As someone who's not familiar enough with LLMs to deduce why this is amazing or ...

heyitsguay · on June 6, 2023

It's not. That's part of the uninformed AI hype train that will consistently be posting first on AI stuff for the next 6 months.

Right now the main bottleneck for LLM size is GPU memory (VRAM). Training requires much more VRAM than inference, which limits the ability for entities that aren't Google or OpenAI-scale to finetune models (aka do a little more training on your custom dataset).

The paper here suggests that one can actually finetune LLMs with inference-sized VRAM usage instead of training-sized VRAM usage. If true, it will be possible to fine tune larger models on smaller (though still expensive) GPUs -- like a single 3090 instead of 1x or 8xA100s. So, more people can create more customized models.

vvladymyrov · on June 6, 2023

Inference can be done on CPU+RAM, but it is much slower (like tens of seconds per token). So reducing the amount of memory used by model during training would reduce the number of compute operations potentially could make CPU+RAM more suitable for fine tuning within reasonable amount of time too. Basically 12x less GPU memory requirement also translates to 12x less compute operations (doing compute on CPU allows less parallelism them on GPU).

The paper doesn’t focus on CPU or GPU training time improvements - I’d assume there is no significant improvement in GPU training case. For CPU it is logical to expect 12x training speed improvement, but it is still too slow to be consistent practically useful.

gliptic · on June 6, 2023

> For CPU it is logical to expect 12x training speed improvement, but it is still too slow to be consistent practically useful.

I don't see what you base this on. MeZO trades one back-propagation pass for another forward pass. Why would that be 12x faster? It's also clear the convergence rate is slower than plain SGD (never mind AdamW) by a factor proportional to the effective rank of the Hessian.

brucethemoose2 · on June 6, 2023

heyitsguay is correct, but in addition:

- The researchers didn't even explore quantization. In theory 4 bit quant would allow for training on even more modest hardware.

- Memory use aside, forward pass only is potentially a big training speed increase.

- This method is very amenable to decentralized training.

It scares me because it feels like a gateway to networked, self training LLMs on commodity hardware. I thought this was a long way away... now it doesn't feel so far away.

gliptic · on June 6, 2023

> - Memory use aside, forward pass only is potentially a big training speed increase.

Forward-pass only doesn't mean it's faster. It converges much slower than even SGD. It is a memory-time tradeoff, but that's it.