Today I would like to show you a micro-optimization I recently used more for the fun of it than for its real impact. It’s an interesting trick that you should never bother to use, nor worry about.
It starts from a piece of Kotlin code that looked a bit like this (the original version used named constants, but I replaced them with their actual values for clarity in this context):
1fun bitsForSize(value: Int): Int {
2 return when {
3 value <= 0x01FFF -> 13
4 value <= 0x07FFF -> 15
5 value <= 0x0FFFF -> 16
6 value <= 0x3FFFF -> 18
7 else -> 0
8 }
9}
This function returns how many bits we need to encode a given value
. The context for this is
complicated and not the point of this article; what I would like to focus on are the magic numbers
used in the function.
Here is the result of the compilation to aarch64 by ART:
1mov w16, #0x1fff
2cmp w1, w16
3b.gt #+0xc (addr 0x1024)
4mov w0, #0xd
5b #+0x40 (addr 0x1060)
6mov w16, #0x7fff
7cmp w1, w16
8b.gt #+0xc (addr 0x1038)
9mov w0, #0xf
10b #+0x2c (addr 0x1060)
11mov w2, #0x12
12mov w0, #0x10
13mov w16, #0xffff
14cmp w1, w16
15cset w3, gt
16mov w16, #0x3ffff
17cmp w1, w16
18csel w2, w2, wzr, le
19cmp w3, #0x0 (0)
20csel w0, w2, w0, ne
21ret
There’s nothing particularly interesting about it, but if you look closely you can see that our
magic constants are loaded directly into the w16
register (for instance in mov w16, #0x3ffff
).
We can slightly improve this code by doing something counter-intuitive, adding an extra step before our tests:
1fun bitsForSize(value: Int): Int {
2 val bits = value.countLeadingZeroBits()
3 return when {
4 bits >= 32 - 13 -> 13
5 bits >= 32 - 15 -> 15
6 bits >= 32 - 16 -> 16
7 bits >= 32 - 18 -> 18
8 else -> 0
9 }
10}
Instead of testing the value
directly, we count the number of leading zeroes in the binary
representation of value
. If you look at the binary representation of all the integers between 0
and 0x01FFF
you can observe that there will always be at least 19 leading zero bits. A similar
obervation holds for the other magic constants in the original code.
Now that we are counting bits, the constants in our tests are much smaller numbers. Instead of
0x3FFFF
for instance (262,143), we now use 14. This helps because when the constants are small
enough, the compiler can encode them directly inside of the comparison (cmp
) instruction.
Each of the branches in the original function produced aarch64 code of this form:
We first load the constant into register w16
, then compare it with register w1
, and finally
branch.
With our new function however, each branch now looks like this:
We directly compare a constant (here 17) with register w0
, and then branch. By using smaller
constants we were able to eliminate a mov
per branch, at the cost of an extra clz
(count leading zeroes) at the top of the function:
1clz w0, w1
2cmp w0, #0x13 (19)
3b.lt #+0xc (addr 0x1104)
4mov w0, #0xd
5b #+0x34 (addr 0x1134)
6cmp w0, #0x11 (17)
7b.lt #+0xc (addr 0x1114)
8mov w0, #0xf
9b #+0x24 (addr 0x1134)
10mov w2, #0x10
11mov w1, #0x12
12cmp w0, #0x10 (16)
13cset w3, lt
14cmp w0, #0xe (14)
15csel w1, w1, wzr, ge
16cmp w3, #0x0 (0)
17csel w0, w1, w2, ne
18ret
Since the clz
takes the place of one of the original mov
, it is a net win as long as we have
more than one branch (and the more branches, the better). The original function compiled down to
21 aarch64 instruction, and the new version down to 18.
Does this matter in terms of performance? On a modern device like a Pixel 8 the difference seems to be about ~0.5% in this exact function. On older/low-power devices, the difference is much greater. I measured it around 11% on a Pixel 2.
Note that the original function needed a single mov
per constant, but it doesn’t mean it would
always be the case. For instance, if we used the constant 0x1_FFFF_0003_FFFEL
, the compiled code
would be:
It’s interesting that the compiler “patches” two 16-bit words in a larger constant where all bits
but one are already set. This allows the compiler to minimize the number of instructions. With the
constant 0x5555_5555_5555_5556L
, the compiler cannot use this trick and must instead issue
4 loads:
1mov x16, #0x5556
2movk x16, #0x5555, lsl #16
3movk x16, #0x5555, lsl #32
4movk x16, #0x5555, lsl #48
So now you know that smaller constants can be cheaper, for some extremely low-level and most likely irrelevant definition of cheaper.