Is the software/drivers for networking LLMs on Strix Halo there yet? I was under...

Tepix · 2026-02-03T21:32:30 1770154350

Check out https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/...

Tepix · 2026-02-03T18:34:38 1770143678

llama.cpp with rpc-server doesn't require a lot of bandwidth during inference. There is a loss of performance.

For example using two Strix Halo you can get 17 or so tokens/s with MiniMax M2.1 Q6. That's a 229B parameter model with a 10b active set (7.5GB at Q6). The theoretical maximum speed with 256GB/s of memory bandwidth would be 34 tokens/s.

Tepix · 2026-01-31T05:47:54 1769838474

Llama.cpp with its rpc-server