Yeah, looks like it's for function alignment padding. It's a pretty common thing...

pbsd · on Dec 1, 2014

You can do an unconditional jump every 1 or 2 cycles, depending on the chip, whereas no chip I know of can execute more than 4 nops per cycle. Therefore I would say the jump is probably marginally faster than 12 nops.

Smart toolchains will turn those 12 bytes into 2 multi-byte nops, e.g., a 9-byte one and a 3-byte one.

0x0 · on Dec 1, 2014

What does a 9byte NOP look like?

pkhuong · on Dec 1, 2014

https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64... has

0x66 0x0f 0x1f 0x84 0x00 0x00 0x00 0x00 0x00

That's a size override prefix, followed by the dedicated NOP instruction (0x0f 0x1f), and finally 6 bytes to encode an effective address with offset.

makomk · on Dec 2, 2014

Multi-byte nops have compatibility issues on some of the more obscure 32-bit x86 CPUs, unfortunately: https://sourceware.org/bugzilla/show_bug.cgi?id=13675

pkhuong · on Dec 2, 2014

Right… you have to check cpuid for the long nop feature. I believe 0x66 0x90 is compatible (but slow, I would expect) with older CPUs.

pbsd · on Dec 1, 2014

    nop word ptr [eax+eax+0] ; 66 0f 1f 84 00 00 00 00 00

danielweber · on Dec 2, 2014

Do JMPs invalidate caches? That was the story I was telling myself where 12 NOPs would be faster than JMPing; I don't know if it's true.

jlebar · on Dec 2, 2014

Loops can be implemented with JMPs, and it would be Very Bad if every iteration of a loop invalidated caches. (In fact, it would be Very Bad if just about anything common invalidated caches, given how important they are to modern CPU performance.)

pbsd · on Dec 2, 2014

I don't know what exactly you mean by that, but I'm going with no. Unconditional jumps do interact with the uops cache in recent Intel chips, but they do so by terminating the current uops cache line---which is generally desirable.

pkhuong · on Dec 1, 2014

Long nops would be marginally quicker.