Optimizing code can be a difficult task because there are so many traps you need to avoid at every step of the way. Today I want to focus on one of the (numerous) benchmarking traps, which you may have run into, and that I myself encounter regularly.
Let’s imagine you are trying to optimize code, and you
notice the use of a value.pow(2f)
. One
obvious way to optimize this is to replace the function call with a multiplication (value * value
),
but is it worth it? Since you are a diligent engineer, you decide to write a microbenchmark to
compare before and after:
1@RunWith(AndroidJUnit4::class)
2class MathBenchmark {
3 @get:Rule
4 val benchmarkRule = BenchmarkRule()
5
6 val data = FloatArray(8_192) {
7 it.toFloat() / 3f
8 }
9
10 @Test
11 fun pow2() {
12 benchmarkRule.measureRepeated {
13 for (f in data) {
14 f.pow(2f)
15 }
16 }
17 }
18
19 @Test
20 fun square() {
21 benchmarkRule.measureRepeated {
22 for (f in data) {
23 f * f
24 }
25 }
26 }
27}
You then run your benchmark on a Pixel 6 and get the following results:
11,673 ns 0 allocs Trace MathBenchmark.square
11,681 ns 0 allocs Trace MathBenchmark.pow2
Problem solved, pow(2f)
is just as a fast as a multiplication, let’s move on… But wait, are we
measuring what we think we are measuring? Even if pow(2f)
is implemented efficiently, I would
expect it to use at least a few more instructions than a single multiplication, and the difference
should be visible in our measurements.
Let’s take a look at the compiled dex code:
1void MathBenchmark.pow2()
2 -- 12 instructions
3 0000: iget-object v0, v8, LExampleBenchmark;.data:[F // field@0000
4 0002: array-length v1, v0
5 0003: const/4 v2, #int 0 // #0
6 0004: if-ge v2, v1, 0012 // +000e
7 0006: aget v3, v0, v2
8 0008: float-to-double v4, v3
9 0009: const/high16 v6, #int 1073741824 // #4000
10 000b: float-to-double v6, v6
11 000c: invoke-static {v4, v5, v6, v7}, Ljava/lang/Math;.pow:(DD)D // method@0004
12 000f: add-int/lit8 v2, v2, #int 1 // #01
13 0011: goto 0004 // -000d
14 0012: return-void
15
16void MathBenchmark.square()
17 -- 9 instructions
18 0000: iget-object v0, v4, LExampleBenchmark;.data:[F // field@0000
19 0002: array-length v1, v0
20 0003: const/4 v2, #int 0 // #0
21 0004: if-ge v2, v1, 000c // +0008
22 0006: aget v3, v0, v2
23 0008: nop // spacer
24 0009: add-int/lit8 v2, v2, #int 1 // #01
25 000b: goto 0004 // -0007
26 000c: return-void
Tthe pow(2f)
case makes a function call, whereas the other method doesn’t do anything. There is
simply an empty loop, our multiplication is gone. The compiler simply optimized the operation away
since it has no side effect and is not necessary for the program. The compiler kept the call to
pow()
because a function call that’s not inlined can have unknown side effects.
We have two problems now:
- Why are both implementations just as fast if one doesn’t do anything?
- Can we bring back our multiplication?
Let’s start with the second problem first to re-run our measures on equal footing. We need to find a way to prevent the compiler from throwing away our multiplication. A simple solution is to do this:
1@Test
2fun pow2() {
3 var result = 0f
4 benchmarkRule.measureRepeated {
5 for (f in data) {
6 result += f.pow(2f)
7 }
8 }
9}
10
11@Test
12fun square() {
13 var result = 0f
14 benchmarkRule.measureRepeated {
15 for (f in data) {
16 result += f * f
17 }
18 }
19}
The compiled dex code now contains our multiplication:
1void MathBenchmark.pow2()
2 -- 16 instructions
3 0000: const/4 v0, #int 0 // #0
4 0001: iget-object v1, v9, LExampleBenchmark;.data:[F // field@0000
5 0003: array-length v2, v1
6 0004: const/4 v3, #int 0 // #0
7 0005: if-ge v3, v2, 0016 // +0011
8 0007: aget v4, v1, v3
9 0009: float-to-double v5, v4
10 000a: const/high16 v7, #int 1073741824 // #4000
11 000c: float-to-double v7, v7
12 000d: invoke-static {v5, v6, v7, v8}, Ljava/lang/Math;.pow:(DD)D // method@0004
13 0010: move-result-wide v5
14 0011: double-to-float v5, v5
15 0012: add-float/2addr v0, v5
16 0013: add-int/lit8 v3, v3, #int 1 // #01
17 0015: goto 0005 // -0010
18 0016: return-void
19
20void MathBenchmark.square()
21 -- 11 instructions
22 0000: const/4 v0, #int 0 // #0
23 0001: iget-object v1, v6, LExampleBenchmark;.data:[F // field@0000
24 0003: array-length v2, v1
25 0004: const/4 v3, #int 0 // #0
26 0005: if-ge v3, v2, 000f // +000a
27 0007: aget v4, v1, v3
28 0009: mul-float v5, v4, v4
29 000b: add-float/2addr v0, v5
30 000c: add-int/lit8 v3, v3, #int 1 // #01
31 000e: goto 0005 // -0009
32 000f: return-void
We are now doing a proper comparison, so let’s run the benchmark again:
11,728 ns 0 allocs Trace MathBenchmark.square
11,729 ns 0 allocs Trace MathBenchmark.pow2
Same result. So is pow(2f)
really as fast as a multiplication? Does ART have a magic intrinsic
that turns the function call into a single multiplication? Let’s go a level deeper and inspect the
aarch64 assembly generated by ART:
1void MathBenchmark.pow2()
2 -- 7 instructions (28 bytes)
3 0x000010b0: str x0, [sp, #-32]!
4 0x000010b4: str lr, [sp, #24]
5 0x000010b8: ldr w0, [x1, #8]
6 0x000010bc: ldr wzr, [x0]
7 0x000010c0: ldr lr, [sp, #24]
8 0x000010c4: add sp, sp, #0x20 (32)
9 0x000010c8: ret
10
11void MathBenchmark.square()
12 -- 5 instructions (20 bytes)
13 0x000010d0: stp x0, lr, [sp, #-16]!
14 0x000010d4: ldr w0, [x1, #8]
15 0x000010d8: ldr wzr, [x0]
16 0x000010dc: ldp xzr, lr, [sp], #16
17 0x000010e0: ret
All of our code is gone! ART is able to see that even when using an accumulator, the work we are
doing has no side effect and decided to get rid of it, and it can do this for the pow(2f)
case
as well.
To properly measure what we are trying to measure, we need to ensure a side effect. An easy way to do this, for instance, is print our accumulator outside of the measurement loop:
1@Test
2fun square() {
3 var result = 0f
4 benchmarkRule.measureRepeated {
5 for (f in data) {
6 result += f * f
7 }
8 }
9 println(result)
10}
The trick with injecting side effects is to find a way to do so without affecting the benchmark
itself, at least not too much. For instance it would be a bad idea to accumulate the results in a
mutableListOf<Float>()
which would cause boxing and trigger garbage collection.
Thankfully, as of version 1.3.0 of the microbenchmark library, you can use the BlackHole
API:
1@Test
2fun square() {
3 benchmarkRule.measureRepeated {
4 for (f in data) {
5 BlackHole.consume(f * f)
6 }
7 }
8}
The consume()
function is an implementation invisible to ART which forces it to preserve the
original code since it cannot assume there are no side effects. That function is implemented as a
@CriticalNative
JNI call that doesn’t do anything, so it’s really fast and safe to use in most
(all?) cases.
Our benchmark can now finally reveal the truth:
// println()
11,755 ns 0 allocs Trace MathBenchmark.square
100,771 ns 0 allocs Trace MathBenchmark.pow2
// BlackHole
12,215 ns 0 allocs Trace MathBenchmark.square
102,351 ns 0 allocs Trace MathBenchmark.pow2
Using pow(2f)
is ~9x slower than a simple multiplication (and I wish R8 or ART would turn calls
to pow(2f)
and pow(3f)
into multiplications).
Even with this simple example we saw that it can be tricky to truly measure what we think we are measuring. This problem can become a lot more complex if you need to take into account memory access patterns, branch predictions, etc.