Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thinking a little further about this, I believe using PSHUFB is the way to go, at least for when the count is large. This is because we can do 2 iterations in essentially one go (haven't tested the code, it's mostly a sketch):

  vmovdqa xmm0, [0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6]
  vmovdqa xmm15, [0x0f, 0x0f, ..., 0x0f]
  vmovdqu xmm7, [rdi]
  _loop_body:
  vpand   xmm8, xmm15, [rdi]
  vpsrlw  xmm9, xmm7, 4
  vpand   xmm9, xmm9, xmm15
  vpshufb xmm8, xmm0, xmm8
  vpshufb xmm9, xmm0, xmm9
  vpaddb  xmm8, xmm8, xmm9

  vpshufb xmm7, xmm8, xmm8 ; since sum <= 12, we already have the next sum in the vector!
                           ; xmm7[0] = xmm8[xmm8[0]]
  vpaddb  xmm8, xmm8, xmm7 ; add it

  vpextrb eax, xmm8, 0
  vmovdqu xmm7, [rdi + rax]
  add rdi, rax
  sub esi, 2
  jnz _loop_body
This is likely extendable to 32-byte vectors with AVX2; have not thought much about that case.


This is exciting, but I think I may have tricked you on a couple of details. The actual distance to the next 'key' is 'sum + 5': 00 represents a 1 byte encoding, not zero, so the minimum offset to the next 'key' is 5. Thus the maximum offset is actually 17, which means one can't depend on having two keys within a single 16 byte vector. I'm trying to figure out if there is some way to compensate for this without halving the performance.


Oh, I missed that. That makes things trickier, but I think we can still get away with something like

  vmovdqu xmm7, [rdi + rax + 5 - 1]
  vpinsrb xmm7, xmm7, [rdi + rax + 0], 0
without too much of a performance penalty. The adjustments to offsets then can be put into the shuffle tables, so there should be no further significant performance loss.


Yes, I think that should work to guarantee two per vector. I hadn't previously considered trying to do that, and appreciate the suggestion and the sketch. I think I have a slightly faster (7 cycle) approach doing one at at time using a 64-bit register as a lookup for the sum of the middle two fields, but this has good promise. Especially if we can get out one farther ahead, so instead of having the vector reload on the critical path, the unused portion of the current vector and a preload can be 'slid' into place. Do you know if there is a good way to simulate a PALIGN but with a non-immediate operand? This might get down to 9-10 cycles for two keys.


I have no idea how to simulate a variable PALIGNR on Intel chips without making the loop extremely slow.

On AMD (with XOP), it can be done using VPPERM, which can shuffle from 2 sources. We can do variable alignment like this:

  vpperm xmm0, xmm1, xmm2, [[0..31] + offset]
On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.


On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.

I tried for a bit, but haven't figured out how to make that work. PSHUFB needs a different XMM operand for each 'rotate'. Loading this operand would take 6 cycles, and I haven't thought of a clever way of generating it in less.

I do greatly appreciate your help, though. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: