Micro-optimizations in Kotlin — 2

In the previous post, we saw how we could micro-optimize Int.sign to save a few instructions. We are now going to turn to Float.sign (and by extension Double.sign).

Float.sign returns the sign of single-precision float value as a single-precision float value. While similar to Int.sign, this API must handle a special cases: Not-a-Number (NaN). The exact behavior of the API is that it will return:

-1.0f if the value is negative
+/-0.0f if the value is zero (floats can encode both positive and negative zero)
1.0f if the value is positive
NaN if the value is NaN

An easy way to implement this API ourselves is to return the input when the input equals 0.0f or NaN, and to return the input’s sign copied onto 1.0f otherwise. Translated to code, we can write:

1public inline val Float.sign: Float get() = if (this == 0.0f || isNaN()) {
2    this
3} else {
4    1.0f.withSign(this)
5}

On Android, Kotlin does not implement Float.sign as above, but delegates to java.lang.Math.signum instead:

1public actual inline val Float.sign: Float get() = nativeMath.signum(this)

If we look at the implementation of signum() on Android, we find:

1public static float signum(float f) {
2    return (f == 0.0f || Float.isNaN(f)) ? f : copySign(1.0f, f);
3}

We can now turn to the generated aarch64 assembly to see what happens once the code runs on an actual device:

 1fcmp s0, #0.0
 2b.eq #+0x28 (addr 0x266c)
 3fcmp s0, s0
 4b.ne #+0x20 (addr 0x266c)
 5fmov s1, #0x70 (1.0000)
 6fmov w0, s0
 7and w0, w0, #0x80000000
 8fmov w1, s1
 9and w1, w1, #0x7fffffff
10orr w0, w0, w1
11fmov s0, w0
12ret

The first interesting thing we can notice is that both isNaN() and copySign() disappear as function calls and are instead replaced with their implementation, via a combination of inlining and instrinsics in ART.

The code is a pretty direct translation of the original Java source:

Lines 1 and 2 check if the value is 0.0f
Lines 3 and 4 check if the value is NaN¹
And the rest implements copySign()

So… all good? I could describe the assembly step by step but it will be easier for most readers if we look at the Java implementation of copySign() directly:

1public static float copySign(float magnitude, float sign) {
2    return Float.intBitsToFloat(
3        (Float.floatToRawIntBits(sign) & (FloatConsts.SIGN_BIT_MASK)) |
4        (Float.floatToRawIntBits(magnitude) & (FloatConsts.EXP_BIT_MASK | FloatConsts.SIGNIF_BIT_MASK))
5    );
6}

What looks like a lot of scary looking bit manipulation is actually fairly simple and relies on a simple fact: the most-significant bit (MSB) of a float (or double) encodes the sign of the number. When the MSB is set to 1, the number is negative, otherwise the number is positive.

This code also uses floatToRawIntBits() to get the bit representation of a float as an integer.

Given this information, the code should be easier to follow:

First we mask the bit representation of the sign input with 0x80000000 to extract the sign bit
Then we mask the bit representation of the magnitude input with 0x7fffffff to extract all the bits except the sign bit
We combine both with a binary OR

Go back to the aarch64 assembly above, and you’ll see it’s exactly what happens (hint: fmov is what does both intBitsToFloat() and floatToRawIntBits() at the assembly level, and I will have something interesting to say about this in a future post) . This tells us something interesting about copySign(): it is not an intrinsic, but it is inlined, and the functions it calls are themselves intrinsics. All function calls disappear.

So far so good, but if we look more closely at the assembly we can notice something rather silly. The code spends a few instructions loading the constant 1.0f just to extract its non-sign bits (the exponent and significand):

 1fcmp s0, #0.0
 2b.eq #+0x28 (addr 0x266c)
 3fcmp s0, s0
 4b.ne #+0x20 (addr 0x266c)
 5fmov s1, #0x70 (1.0000)
 6fmov w0, s0
 7and w0, w0, #0x80000000
 8fmov w1, s1
 9and w1, w1, #0x7fffffff
10orr w0, w0, w1
11fmov s0, w0
12ret

But 1.0f is a known constant, and instead of extracting its bits at runtime we could just… use the hexadecimal representation of 1.0f, 0x3f800000.

This is something you would normally expect a compiler or an optimizer to do when executing a pass of constant folding. Unfortunately ART currently does not perform constant folding through intrinsics².

So what can we do about it? We can rewrite Float.sign/signum() to bypass copySign()/withSign() and do the sign copy ourselves. Here’s a Kotlin version:

1public inline val Float.sign: Float get() = if (this == 0.0f || isNaN()) {
2     this
3} else {
4     Float.fromBits((toRawBits() and 0x80000000.toInt()) or 0x3f800000)
5}

Doing this saves two instructions in the generated aarch64 code, and this optimization will be delivered in a future update of libcore, ART’s standard library:

 1fcmp s0, #0.0
 2b.eq #+0x20 (addr 0x26a4)
 3fcmp s0, s0
 4cset w0, ne
 5cbnz w0, #+0x14 (addr 0x26a4)
 6fmov w0, s0
 7and w0, w0, #0x80000000
 8orr w0, w0, #0x3f800000
 9fmov s0, w0
10ret

It is interesting to note that swapping implementation has the side-effect of changing how isNaN() is performed. Instead of a comparison (fcmp) and a jump (b.ne), we now use a comparison followed by cset and cbnz. This behavior is apparently caused by the different code generation paths taken in both cases (inlining vs not), and this means we could in theory save another instruction.

Update: Thanks to Pete Cawley’s suggestion, I tried a C++ implementation. The C++ version is just a straight port of the Java/Kotlin implementation:

 1__attribute__((always_inline)) 
 2inline uint32_t to_uint32(float x) {
 3    uint32_t a;
 4    std::memcpy(&a, &x, sizeof(x));
 5    return a;
 6}
 7
 8__attribute__((always_inline)) 
 9inline float to_float(uint32_t x) {
10    float a;
11    std::memcpy(&a, &x, sizeof(x));
12    return a;
13}
14
15float sign(float x) {
16    if (x == 0.0f || std::isnan(x)) {
17        return x;
18    } else {
19        uint32_t d = to_uint32(x);
20        return to_float((d & 0x80000000) | 0x3f800000);
21    }
22}

With this implementation, the compiler (clang 17.0) will produce the following aarch64 code:

1fmov    w8, s0
2fcmp    s0, #0.0
3and     w8, w8, #0x80000000
4orr     w8, w8, #0x3f800000
5fmov    s1, w8
6fcsel   s1, s0, s1, eq
7fcsel   s0, s0, s1, vs
8ret

This new version saves another 2 instructions, for a total of 4 instructions (30%) compared to the original implementation (and it’s branchless!). This solution relies on the fact that fcmp will set the overflow flag when either operand is NaN. Since we perform a comparison to 0.0f which we know cannot be NaN, we can check the V flag to know if the input is NaN. It is achieved above using the fcsel instruction and the vs operator. This means that unless ART could perform the same optimization automatically, it might be worth implementing Math.signum as an intrinsic.

Next time, we’ll take a look at one of the following topics:

floatToRawIntBits() and a not-so-micro micro-optimization
Optimizing value classes
Using a better HashMap
Faster range-checks
Optimizing code size with de-inlining
Optimizing a text parser

The fcmp s0, s0 instruction compares the input to itself. Since NaN is never equal to anything, including NaN, it’s a quick and easy way to check for NaN. ↩︎
ART recently gained the ability to fold constants through intrinsics for integer types though. ↩︎