Its also easy to do 120b on CPU if you have the resources. I had 120b running on my home LLM CPU inference box in just as long as it took to download the GGUFs, git pull and rebuild llama-server.
I had it running at 40t/s with zero effort and 50t/s with a brief tweaking.
Its just too bad that even the 120b isn't really worth running compared to the other models that are out there.
It really is amazing what ggerganov and the llama.cpp team have done to democratize LLMs for individuals that can't afford a massive GPU farm worth more than the average annual salary.
2xEPYC Genoa w/768GB of DDR5-4800 and an A5000 24GB card.
I built it in January 2024 for about $6k and have thoroughly enjoyed running every new model as it gets released. Some of the best money I’ve ever spent.
I've seen some mentions of pure-cpu setups being successful for large models using old epyc/xeon workstations off ebay with 40+ cpus. Interesting approach!
Wow that's not bad. It's strange, for me it is much much slower on a Radeon Pro VII (also 16GB, with a memory bandwidth of 1TB/s!) and a Ryzen 5 5600 with also 64GB. It's basically unworkably slow. Also, I only get 100% CPU when I check ollama ps, the GPU is not being used at all :( It's also counterproductive because the model is just too large for 64GB.
I wonder what makes it work so well on yours! My CPU isn't much slower and my GPU probably faster.
AMD basically decided they wanted to focus on HPC and data center customers rather than consumers, and so GPGPU driver support for consumer cards has been
non-existing or terrible[1].
The Radeon VII Pro is not a consumer card though and works well with ROCm. It even has datacenter "grade" HBM2 memory that most Nvidias don't have. The continuing support has been dropped but ROCm of course still works fine. It's nearly as fast in Ollama as my 4090 (which I don't use for AI regularly but I just play with it sometimes)
It really is amazing what ggerganov and the llama.cpp team have done to democratize LLMs for individuals that can't afford a massive GPU farm worth more than the average annual salary.