Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is the software/drivers for networking LLMs on Strix Halo there yet? I was under the impression a few weeks ago that it's veeeery early stages and terribly slow.





llama.cpp with rpc-server doesn't require a lot of bandwidth during inference. There is a loss of performance.

For example using two Strix Halo you can get 17 or so tokens/s with MiniMax M2.1 Q6. That's a 229B parameter model with a 10b active set (7.5GB at Q6). The theoretical maximum speed with 256GB/s of memory bandwidth would be 34 tokens/s.


Llama.cpp with its rpc-server



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: