Is the software/drivers for networking LLMs on Strix Halo there yet? I was under the impression a few weeks ago that it's veeeery early stages and terribly slow.
llama.cpp with rpc-server doesn't require a lot of bandwidth during inference. There is a loss of performance.
For example using two Strix Halo you can get 17 or so tokens/s with MiniMax M2.1 Q6. That's a 229B parameter model with a 10b active set (7.5GB at Q6). The theoretical maximum speed with 256GB/s of memory bandwidth would be 34 tokens/s.