A Micro-optimization You Will Never Need

Today I would like to show you a micro-optimization I recently used more for the fun of it than for its real impact. It’s an interesting trick that you should never bother to use, nor worry about.

It starts from a piece of Kotlin code that looked a bit like this (the original version used named constants, but I replaced them with their actual values for clarity in this context):

1fun bitsForSize(value: Int): Int {
2    return when {
3        value <= 0x01FFF -> 13
4        value <= 0x07FFF -> 15
5        value <= 0x0FFFF -> 16
6        value <= 0x3FFFF -> 18
7        else -> 0
8    }
9}

This function returns how many bits we need to encode a given value. The context for this is complicated and not the point of this article; what I would like to focus on are the magic numbers used in the function.

Here is the result of the compilation to aarch64 by ART:

 1mov  w16, #0x1fff
 2cmp  w1, w16
 3b.gt #+0xc (addr 0x1024)
 4mov  w0, #0xd
 5b    #+0x40 (addr 0x1060)
 6mov  w16, #0x7fff
 7cmp  w1, w16
 8b.gt #+0xc (addr 0x1038)
 9mov  w0, #0xf
10b    #+0x2c (addr 0x1060)
11mov  w2, #0x12
12mov  w0, #0x10
13mov  w16, #0xffff
14cmp  w1, w16
15cset w3, gt
16mov  w16, #0x3ffff
17cmp  w1, w16
18csel w2, w2, wzr, le
19cmp  w3, #0x0 (0)
20csel w0, w2, w0, ne
21ret

There’s nothing particularly interesting about it, but if you look closely you can see that our magic constants are loaded directly into the w16 register (for instance in mov w16, #0x3ffff).

We can slightly improve this code by doing something counter-intuitive, adding an extra step before our tests:

 1fun bitsForSize(value: Int): Int {
 2    val bits = value.countLeadingZeroBits()
 3    return when {
 4        bits >= 32 - 13 -> 13
 5        bits >= 32 - 15 -> 15
 6        bits >= 32 - 16 -> 16
 7        bits >= 32 - 18 -> 18
 8        else -> 0
 9    }
10}

Instead of testing the value directly, we count the number of leading zeroes in the binary representation of value. If you look at the binary representation of all the integers between 0 and 0x01FFF you can observe that there will always be at least 19 leading zero bits. A similar obervation holds for the other magic constants in the original code.

Now that we are counting bits, the constants in our tests are much smaller numbers. Instead of 0x3FFFF for instance (262,143), we now use 14. This helps because when the constants are small enough, the compiler can encode them directly inside of the comparison (cmp) instruction.

Each of the branches in the original function produced aarch64 code of this form:

1mov  w16, #0x7fff
2cmp  w1, w16
3b.gt #+0xc (addr 0x1038)

We first load the constant into register w16, then compare it with register w1, and finally branch.

With our new function however, each branch now looks like this:

1cmp  w0, #0x11 (17)
2b.lt #+0xc (addr 0x1114)

We directly compare a constant (here 17) with register w0, and then branch. By using smaller constants we were able to eliminate a mov per branch, at the cost of an extra clz (count leading zeroes) at the top of the function:

 1clz  w0, w1
 2cmp  w0, #0x13 (19)
 3b.lt #+0xc (addr 0x1104)
 4mov  w0, #0xd
 5b    #+0x34 (addr 0x1134)
 6cmp  w0, #0x11 (17)
 7b.lt #+0xc (addr 0x1114)
 8mov  w0, #0xf
 9b    #+0x24 (addr 0x1134)
10mov  w2, #0x10
11mov  w1, #0x12
12cmp  w0, #0x10 (16)
13cset w3, lt
14cmp  w0, #0xe (14)
15csel w1, w1, wzr, ge
16cmp  w3, #0x0 (0)
17csel w0, w1, w2, ne
18ret

Since the clz takes the place of one of the original mov, it is a net win as long as we have more than one branch (and the more branches, the better). The original function compiled down to 21 aarch64 instruction, and the new version down to 18.

Does this matter in terms of performance? On a modern device like a Pixel 8 the difference seems to be about ~0.5% in this exact function. On older/low-power devices, the difference is much greater. I measured it around 11% on a Pixel 2.

Note that the original function needed a single mov per constant, but it doesn’t mean it would always be the case. For instance, if we used the constant 0x1_FFFF_0003_FFFEL, the compiled code would be:

1mov  x16, #0xfffffffffffffffe
2movk x16, #0x3, lsl #16
3movk x16, #0x1, lsl #48

It’s interesting that the compiler “patches” two 16-bit words in a larger constant where all bits but one are already set. This allows the compiler to minimize the number of instructions. With the constant 0x5555_5555_5555_5556L, the compiler cannot use this trick and must instead issue 4 loads:

1mov  x16, #0x5556
2movk x16, #0x5555, lsl #16
3movk x16, #0x5555, lsl #32
4movk x16, #0x5555, lsl #48

So now you know that smaller constants can be cheaper, for some extremely low-level and most likely irrelevant definition of cheaper.