You Are Going to Need It

Optimizing code can be a difficult task because there are so many traps you need to avoid at every step of the way. Today I want to focus on one of the (numerous) benchmarking traps, which you may have run into, and that I myself encounter regularly.

Let’s imagine you are trying to optimize code, and you notice the use of a value.pow(2f). One obvious way to optimize this is to replace the function call with a multiplication (value * value), but is it worth it? Since you are a diligent engineer, you decide to write a microbenchmark to compare before and after:

 1@RunWith(AndroidJUnit4::class)
 2class MathBenchmark {
 3    @get:Rule
 4    val benchmarkRule = BenchmarkRule()
 5
 6    val data = FloatArray(8_192) {
 7        it.toFloat() / 3f
 8    }
 9
10    @Test
11    fun pow2() {
12        benchmarkRule.measureRepeated {
13            for (f in data) {
14                f.pow(2f)
15            }
16        }
17    }
18
19    @Test
20    fun square() {
21        benchmarkRule.measureRepeated {
22            for (f in data) {
23                f * f
24            }
25        }
26    }
27}

You then run your benchmark on a Pixel 6 and get the following results:

11,673   ns           0 allocs    Trace    MathBenchmark.square
11,681   ns           0 allocs    Trace    MathBenchmark.pow2

Problem solved, pow(2f) is just as a fast as a multiplication, let’s move on… But wait, are we measuring what we think we are measuring? Even if pow(2f) is implemented efficiently, I would expect it to use at least a few more instructions than a single multiplication, and the difference should be visible in our measurements.

Let’s take a look at the compiled dex code:

 1void MathBenchmark.pow2()
 2    -- 12 instructions
 3    0000: iget-object v0, v8, LExampleBenchmark;.data:[F // field@0000
 4    0002: array-length v1, v0
 5    0003: const/4 v2, #int 0 // #0
 6    0004: if-ge v2, v1, 0012 // +000e
 7    0006: aget v3, v0, v2
 8    0008: float-to-double v4, v3
 9    0009: const/high16 v6, #int 1073741824 // #4000
10    000b: float-to-double v6, v6
11    000c: invoke-static {v4, v5, v6, v7}, Ljava/lang/Math;.pow:(DD)D // method@0004
12    000f: add-int/lit8 v2, v2, #int 1 // #01
13    0011: goto 0004 // -000d
14    0012: return-void
15
16void MathBenchmark.square()
17    -- 9 instructions
18    0000: iget-object v0, v4, LExampleBenchmark;.data:[F // field@0000
19    0002: array-length v1, v0
20    0003: const/4 v2, #int 0 // #0
21    0004: if-ge v2, v1, 000c // +0008
22    0006: aget v3, v0, v2
23    0008: nop // spacer
24    0009: add-int/lit8 v2, v2, #int 1 // #01
25    000b: goto 0004 // -0007
26    000c: return-void

Tthe pow(2f) case makes a function call, whereas the other method doesn’t do anything. There is simply an empty loop, our multiplication is gone. The compiler simply optimized the operation away since it has no side effect and is not necessary for the program. The compiler kept the call to pow() because a function call that’s not inlined can have unknown side effects.

We have two problems now:

Why are both implementations just as fast if one doesn’t do anything?
Can we bring back our multiplication?

Let’s start with the second problem first to re-run our measures on equal footing. We need to find a way to prevent the compiler from throwing away our multiplication. A simple solution is to do this:

 1@Test
 2fun pow2() {
 3    var result = 0f
 4    benchmarkRule.measureRepeated {
 5        for (f in data) {
 6            result += f.pow(2f)
 7        }
 8    }
 9}
10
11@Test
12fun square() {
13    var result = 0f
14    benchmarkRule.measureRepeated {
15        for (f in data) {
16            result += f * f
17        }
18    }
19}

The compiled dex code now contains our multiplication:

 1void MathBenchmark.pow2()
 2    -- 16 instructions
 3    0000: const/4 v0, #int 0 // #0
 4    0001: iget-object v1, v9, LExampleBenchmark;.data:[F // field@0000
 5    0003: array-length v2, v1
 6    0004: const/4 v3, #int 0 // #0
 7    0005: if-ge v3, v2, 0016 // +0011
 8    0007: aget v4, v1, v3
 9    0009: float-to-double v5, v4
10    000a: const/high16 v7, #int 1073741824 // #4000
11    000c: float-to-double v7, v7
12    000d: invoke-static {v5, v6, v7, v8}, Ljava/lang/Math;.pow:(DD)D // method@0004
13    0010: move-result-wide v5
14    0011: double-to-float v5, v5
15    0012: add-float/2addr v0, v5
16    0013: add-int/lit8 v3, v3, #int 1 // #01
17    0015: goto 0005 // -0010
18    0016: return-void
19
20void MathBenchmark.square()
21    -- 11 instructions
22    0000: const/4 v0, #int 0 // #0
23    0001: iget-object v1, v6, LExampleBenchmark;.data:[F // field@0000
24    0003: array-length v2, v1
25    0004: const/4 v3, #int 0 // #0
26    0005: if-ge v3, v2, 000f // +000a
27    0007: aget v4, v1, v3
28    0009: mul-float v5, v4, v4
29    000b: add-float/2addr v0, v5
30    000c: add-int/lit8 v3, v3, #int 1 // #01
31    000e: goto 0005 // -0009
32    000f: return-void

We are now doing a proper comparison, so let’s run the benchmark again:

11,728   ns           0 allocs    Trace    MathBenchmark.square
11,729   ns           0 allocs    Trace    MathBenchmark.pow2

Same result. So is pow(2f) really as fast as a multiplication? Does ART have a magic intrinsic that turns the function call into a single multiplication? Let’s go a level deeper and inspect the aarch64 assembly generated by ART:

 1void MathBenchmark.pow2()
 2    -- 7 instructions (28 bytes)
 3    0x000010b0: str x0, [sp, #-32]!
 4    0x000010b4: str lr, [sp, #24]
 5    0x000010b8: ldr w0, [x1, #8]
 6    0x000010bc: ldr wzr, [x0]
 7    0x000010c0: ldr lr, [sp, #24]
 8    0x000010c4: add sp, sp, #0x20 (32)
 9    0x000010c8: ret
10
11void MathBenchmark.square()
12    -- 5 instructions (20 bytes)
13    0x000010d0: stp x0, lr, [sp, #-16]!
14    0x000010d4: ldr w0, [x1, #8]
15    0x000010d8: ldr wzr, [x0]
16    0x000010dc: ldp xzr, lr, [sp], #16
17    0x000010e0: ret

All of our code is gone! ART is able to see that even when using an accumulator, the work we are doing has no side effect and decided to get rid of it, and it can do this for the pow(2f) case as well.

To properly measure what we are trying to measure, we need to ensure a side effect. An easy way to do this, for instance, is print our accumulator outside of the measurement loop:

 1@Test
 2fun square() {
 3    var result = 0f
 4    benchmarkRule.measureRepeated {
 5        for (f in data) {
 6            result += f * f
 7        }
 8    }
 9    println(result)
10}

The trick with injecting side effects is to find a way to do so without affecting the benchmark itself, at least not too much. For instance it would be a bad idea to accumulate the results in a mutableListOf<Float>() which would cause boxing and trigger garbage collection.

Thankfully, as of version 1.3.0 of the microbenchmark library, you can use the BlackHole API:

1@Test
2fun square() {
3    benchmarkRule.measureRepeated {
4        for (f in data) {
5            BlackHole.consume(f * f)
6        }
7    }
8}

The consume() function is an implementation invisible to ART which forces it to preserve the original code since it cannot assume there are no side effects. That function is implemented as a @CriticalNative JNI call that doesn’t do anything, so it’s really fast and safe to use in most (all?) cases.

Our benchmark can now finally reveal the truth:

// println()
 11,755   ns           0 allocs    Trace    MathBenchmark.square
100,771   ns           0 allocs    Trace    MathBenchmark.pow2

// BlackHole
 12,215   ns           0 allocs    Trace    MathBenchmark.square
102,351   ns           0 allocs    Trace    MathBenchmark.pow2

Using pow(2f) is ~9x slower than a simple multiplication (and I wish R8 or ART would turn calls to pow(2f) and pow(3f) into multiplications).

Even with this simple example we saw that it can be tricky to truly measure what we think we are measuring. This problem can become a lot more complex if you need to take into account memory access patterns, branch predictions, etc.