Another very interesting characteristic of the arm instruction set is that MOV and AND have their opposite op (MVN and BIC).
That means you can write MOV, r0, #-1 (equivalent to #0xFFFFFFFF) and the assembler will emit a MVN r0, #0 to fulfill your needs. This increases the number of constants that can be directly assigned.
The MVN and BIC are not the "opposite ops", they are the same operations but they do "binary not" over every bit of the parameter which is easy to implement in hardware. It's true this also allows for better code density.
Just wondering, would it be better to normalise like flooding point does and always include a virtual 1 on the front of the number? This encoding wastes space because some numbers can be encoded multiple ways e.g. 3 and 6.
On the other hand, that would mean zero couldn't be encoded, but you don't need a zero for most maths as it doesn't do anything (and you could use a zero register.)
Edit to add: the even shift is a problem, but you could use 7 bits of number, 1 implicit leading 1 bit, and 5 bits to cover any 32-bit shift.
I doubt it'll be as easy as it was for 32-bit ARM because AArch64 has no (well, only a few) predicated instructions. Instead, the 64-bit ISA effectively uses those bits to name additional registers.
AArch64 already has similar code density to x86-64 anyway.
Wow, this is very cool and I had no idea it was part of the architecture. It's from 2014, but still news to me!
I find myself using (apparently unnecessary) 'LDR' instructions because of this sort of thing all the time. And that adds up, in Cortex-M chips with kilobytes of program memory.
Note that the Cortex-M chips implement only the Thumb/Thumb2 encodings. Those have different rules for encoding immediates -- the rules this article describes are for the 32-bit ARM (A32) encoding.
I did some off-the-cuff analysis of some random code a couple of years ago and found the traditional ARM constant encoding to be not as good - for that code, at least - as the MIPS-/POWER-style load low/or high business. So if my code is at all representative I'm not surprised they went for MOVW+MOVT in the end: https://news.ycombinator.com/item?id=11607119#11608650
Regarding the 00XY00XY and XY00XY00 forms: I'm going to have to sleep on this, but this is a bit of mystery to me so far. I'm going to have to keep an eye out for those sorts of constants now! Maybe they're actually quite common?? - I suppose this will certainly let you form many kinds of mask useful for SIMD operations.
PowerPC has addis (add immediate shifted) though, which as it implies, shifts the immediate value 16bits to the left before the add. In this worst case, you can add any arbitrary 32bit value to any register with two opcodes.
So, is it the case that with ARM, you _may_ be able to do the same add in one op, but it's worst case appears to be 4 ops for the full range of immediate values?
For nonrepresentable constants, a compiler will usually generate a constant pool and emit a single PC-relative load instruction to materialize the value. Many assemblers have directives to do this automatically.
In ARMv7 and previous, the program counter is an ordinary register, so you can perform a load relative to the PC for anything that doesn't fit into a register's literal field. Compilers typically emit so-called literal pools before the entry or after the exit point of each function (or sometimes before a basic block inside very large functions). Some compilers support literal pool merging as a code size reduction technique, too.
Edit: Also, ARMv8 has dedicated instructions for for pc-relative loads and jumps.
On ARM1156, and ARMv7 and newer, use MOVW, which takes a simple 16-bit immediate. If 16 bits doesn’t do it, use a MOVW/MOVT pair which can load any 32-bit value.
On older ARMs, either do a load immediate followed by an immediate arithmetic, like so:
MOV R0, 0x0100
ADD R0, R0, 0x01
Or use a PC relative load:
LDR R0, [PC, offset where you stored 0x101]
Which the assembler allows you to abbreviate as:
LDR R0, =0x101
And it will find a spot for 0x101 and generate the correct PC relative load.
Modern ARM actually has a 16-bit immediate version of MOV, plus MOVT which is the same but moves to the top 16 bits of a register. With a pair of those, you can load any 32-bit value in 2 instructions, without polluting the data cache like you would with LDR [pc+offset].