The Path Not Taken

In the last post, we saw that benchmarks don’t always measure what we think they measure. Let’s look at another instance of this problem today, starting with this rather simple benchmark:

 1@RunWith(AndroidJUnit4::class)
 2class DataBenchmark {
 3    @get:Rule
 4    val benchmarkRule = BenchmarkRule()
 5
 6    // Generate data in [0..255]
 7    private val data = IntArray(65_536) {
 8        it % 256
 9    }
10
11    @Test
12    fun processData() {
13        var sum = 0f
14        benchmarkRule.measureRepeated {
15            for (d in data) {
16                if (d < 128) {
17                    sum += d / 128f
18                }
19            }
20        }
21        BlackHole.consume(sum)
22    }
23}

This benchmark tests an algorithm that processes an array of values between 0 and 255 (8-bit pixels for instance), and computes the sum of the normalized values for pixels less than 128.

Running this benchmark on a Pixel 6 on Android 15 gives the following result:

186,432 ns    DataBenchmark.processData

This in itself doesn’t tell us much, but what if I told you you can make this benchmark ~2.3x slower without changing the processData() function? The key is in how we have declared our test data array:

1private val data = IntArray(65_536) {
2    it % 256
3}

As declared, the array will contain series of values from 0 to 255, repeated multiple times. Let’s change this declara to make the data random:

1private val data = IntArray(65_536) {
2    Random.nextInt(256)
3}

If we run the benchmark again on the same device, we get this result:

426,655 ns    DataBenchmark.processData

The reason the first version is so much faster is because of branch prediction. Since we iterate data in order from 0 to 255 and execute the test if (data < 128), the branch predictor will guess correctly every time (except when going from 127->128 and 255->0). With random data, the predictor will only be right ~50% of the time, causing a large number of branch misses, and therefore costly pipeline flushes.

We can compare the number of branch misses in the two benchmarks to observe this behavior:

processDataInOrder
BranchMisses      min     256.1,   median     256.1,   max     256.9

processDataRandomized
BranchMisses      min  29,280.5,   median  29,388.5,   max  29,460.3

That’s 114x more branch misses. Oops.

Tip

Be careful when writing benchmarks and make sure the data you process resembles what your application will use.

Speeding up randomized access Link to heading

Now that we are measuring “real-world” data (unless you are only processing gradients, you won’t find perfectly ordered sequences of pixels), we can improve our algorithm and make it branchless at the aarch64 level:

 1@Test
 2fun processDataBranchless() {
 3    var sum = 0f
 4    benchmarkRule.measureRepeated {
 5        for (d in dataRandomized) {
 6            val f = if (d < 128) d else 0
 7            sum += f / 128f
 8        }
 9    }
10    BlackHole.consume(sum)
11}

By moving the addition outside of the condition, we can change the branch to produce a value that’s either the original value, or 0 (we could use a min() as well, but it’s interesting to directly observe the behavior of an if here). This in turns allows ART to replace the branch with a conditional select instruction:

1cmp   w4, #0x80 (128)
2; before
3b.gt  #+0x10
4; after
5csel  w4, w4, wzr, le

And if the compiler is not smart enough to do it itself, you can always resort to bitwise trickery to achieve the same result:

1val selector = (d - 128) shr 31
2sum += (d and selector) / 128f

This solution is not necessary on Android and only works in specific cases, so stick with if if you can. Either way, the benchmark processing randomized data is now just as fast as the original linear data processor (slightly faster actually since it has fewer branch misses):

processDataInOrder
timeNs            min 186,259.1,   median 186,511.5,   max 203,527.1
BranchMisses      min     256.1,   median     256.2,   max     264.8

processDataRandomized
timeNs            min 421,986.9,   median 424,896.4,   max 436,348.3
BranchMisses      min  29,269.4,   median  29,381.5,   max  30,227.5

processDataBranchless
timeNs            min 186,126.9,   median 186,360.4,   max 190,971.6
BranchMisses      min       1.0,   median       1.1,   max       2.6

processDataBranchlessBitwise
timeNs            min 186,168.5,   median 186,348.8,   max 186,750.6
BranchMisses      min       1.0,   median       2.0,   max       2.4

Other subtle behaviors can impact your benchmarks, but that’s a topic for a future post.