Thinking a little further about this, I believe using PSHUFB is the way to go, a...

nkurz · on Aug 1, 2013

This is exciting, but I think I may have tricked you on a couple of details. The actual distance to the next 'key' is 'sum + 5': 00 represents a 1 byte encoding, not zero, so the minimum offset to the next 'key' is 5. Thus the maximum offset is actually 17, which means one can't depend on having two keys within a single 16 byte vector. I'm trying to figure out if there is some way to compensate for this without halving the performance.

pbsd · on Aug 1, 2013

Oh, I missed that. That makes things trickier, but I think we can still get away with something like

  vmovdqu xmm7, [rdi + rax + 5 - 1]
  vpinsrb xmm7, xmm7, [rdi + rax + 0], 0

without too much of a performance penalty. The adjustments to offsets then can be put into the shuffle tables, so there should be no further significant performance loss.

nkurz · on Aug 1, 2013

Yes, I think that should work to guarantee two per vector. I hadn't previously considered trying to do that, and appreciate the suggestion and the sketch. I think I have a slightly faster (7 cycle) approach doing one at at time using a 64-bit register as a lookup for the sum of the middle two fields, but this has good promise. Especially if we can get out one farther ahead, so instead of having the vector reload on the critical path, the unused portion of the current vector and a preload can be 'slid' into place. Do you know if there is a good way to simulate a PALIGN but with a non-immediate operand? This might get down to 9-10 cycles for two keys.

pbsd · on Aug 1, 2013

I have no idea how to simulate a variable PALIGNR on Intel chips without making the loop extremely slow.

On AMD (with XOP), it can be done using VPPERM, which can shuffle from 2 sources. We can do variable alignment like this:

  vpperm xmm0, xmm1, xmm2, [[0..31] + offset]

On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.

nkurz · on Aug 3, 2013

On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.

I tried for a bit, but haven't figured out how to make that work. PSHUFB needs a different XMM operand for each 'rotate'. Loading this operand would take 6 cycles, and I haven't thought of a clever way of generating it in less.

I do greatly appreciate your help, though. Thanks!