Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1MB of context can maybe hold 10 tokens depending on your model.

For reference. llama 3.2 8B used to take 4 KiB per token per layer. At 32 layers that is 128KiB or 8 tokens per MiB of KV cache (context). If your context holds 8000 tokens including responses then you need around 1GB.

>Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.

Matrix vector multiplication implies a single floating point multiplication and addition (2 flops) per parameter. Your GPU can do way more flops than that without using tensor cores at all. In fact, this workload bores your GPU to death.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: