When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this:
### Diagnosing parallelism pathologies (L1)
*Grid occupancy:*
- `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)?
- < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization
will NOT help.
- 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure.
- 1.0-4.0 = healthy; micro-optimization can help.
*Within-block distribution:*
- Does the kernel do useful work across all threads, or is there an
`if (threadIdx.x == 0)` gate around a serial top-k, reduction, or
scan? For c=1 decode, many kernels can't grow the grid, but they can
always parallelize inside the block.
- `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a
strong secondary signal of the within-block pathology.
*Router top-k (within-block fix)*:
- Kernel: `qwen35_router_select_kernel` @ c=1 decode
- Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)`
gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays.
- Fix: warp-shuffle parallel argmax across the whole block + `__shared__`
top_vals buffer eliminating the spill.
- Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.
### Diagnosing parallelism pathologies (L1)
*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.
*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.
*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.