Micro-optimizations in Kotlin — 1

While my work responsibilities do not leave me much time to write code nowadays, I have managed to make a few small contributions to Jetpack Compose in the last few months, mostly focusing on performance.

If you are an Android app developer, your performance concerns probably start and stop at a fairly high level ¹. I find working on large scale libraries like Compose fascinating because you need to worry about performance not only at a macro level, but also at a micro level. Since parts of the libraries can be invoked frequently (many times per frame for instance), even micro-optimizations can make a difference ².

Jetpack Compose obviously benefits greatly from the amazing work done by kotlinc, R8, and ART to automagically optimize both your apps and our libraries, but these automatic optimizations have their own limits. Importantly, some of those optimizations — the ones performed by R8 — will not apply in debug mode. This means that there are optimizations that will matter to developers if they allow us to improve their debugging workflow.

With that in mind, I have spent a lot of time looking at the code in Jetpack Compose to find optimization opportunities at all levels of the stack. To help me do this, I have built and published kotlin-explorer, a desktop app that makes it easier to visualize Kotlin code as both dex bytecode and ARM 64-bit assembly. Using this tool revealed a few fascinating low-level optimization opportunities I would like to share with you, starting with Int.sign. Next time, we’ll look at Float.sign.

Int.sign is a simple API that returns the sign of an integer as an integer:

-1 if the value is negative
0 if the value is zero
1 if the value is positive

In Kotlin, Int.sign is implemented as follows in the standard library:

1public actual val Int.sign: Int get() = when {
2    this < 0 -> -1
3    this > 0 -> 1
4    else -> 0
5}

The implementation is clean, concise and does exactly what it should, so what could we possibly improve here? To figure this out, let’s look at the generated dex bytecode:

1if-gez v0, 0004 // +0004
2const/4 v0, #int -1 // #ff
3goto 0009 // +0006
4if-lez v0, 0008 // +0004
5const/4 v0, #int 1 // #1
6goto 0009 // +0002
7const/4 v0, #int 0 // #0
8return v0

This assembly is a direct translation of the original Kotlin code into dex instructions, so let’s go a level deeper and look at the aarch64 assembly that will run on your Android device:

1cmp w1, #0x0 (0)
2cset w0, ge
3cmp w1, #0x0 (0)
4cset w1, gt
5cmp w0, #0x0 (0)
6csinv w0, w1, wzr, ne
7ret

This version is a little better because it removes the branches found in the original code. It relies instead on aarch64’s conditional but branchless instructions cset and csinv. Even if you don’t fully understand aarch64 assembly, the fact that the comparison instruction cmp is used twice to compare the w1 register to 0 should raise questions. And it is indeed possible to write a more optimized version of Int.sign in aarch64:

1cmp w1, #0x0 (0)
2cset w0, ne
3cneg w0, w0, lt
4ret

This new version uses half of the instructions (excluding ret) compared to the previous solution. Thankfully, it is easy to get Int.sign to produce this code by forcing it to use java.lang.Integer.signum() for which ART provides an optimized instrinsic:

1public actual val Int.sign: Int get() = Integer.signum(this)

So what should you do about this? You could create your own version of Int.sign (see below as well) or you could wait for Kotlin 2.0 which will include a fix. JetBrains measured the impact of this change on JDK21 on Linux, and the improvements are significant.

There is another way to implement Int.sign without branches that doesn’t rely on runtime intrinsics to produce good aarch64 assembly:

1val Int.sign: Int get() = (this shr 31) or (-this ushr 31)

With some bit-manipulation ³ trickery we end up with the following aarch64 code:

1neg w0, w1
2lsr w0, w0, #31
3orr w0, w0, w1, asr #31
4ret

Does this version matter? No idea, I have not benchmarked it ⁴. But it’s neat.

As they should. ↩︎
Especially when you add up the effects of many such micro-optimizations. ↩︎
I love Kotlin but I strongly dislike its bitwise operators, especially to shift bits around. ↩︎
But JetBrains did and it looks to be as fast as the signum() version. ↩︎