In the previous post, we saw how we could
micro-optimize Int.sign
to save a few instructions. We are now going to turn to Float.sign
(and by extension Double.sign
).
Float.sign
returns the sign
of single-precision float value as a single-precision float value. While similar to Int.sign
, this
API must handle a special cases: Not-a-Number (NaN
). The exact behavior of the API is that it will
return:
-1.0f
if the value is negative+/-0.0f
if the value is zero (floats can encode both positive and negative zero)1.0f
if the value is positiveNaN
if the value isNaN
An easy way to implement this API ourselves is to return the input when the input equals 0.0f
or
NaN
, and to return the input’s sign copied onto 1.0f
otherwise. Translated to code, we can
write:
1public inline val Float.sign: Float get() = if (this == 0.0f || isNaN()) {
2 this
3} else {
4 1.0f.withSign(this)
5}
On Android, Kotlin does not implement Float.sign
as above, but delegates to
java.lang.Math.signum
instead:
1public actual inline val Float.sign: Float get() = nativeMath.signum(this)
If we look at the implementation of signum()
on Android, we find:
1public static float signum(float f) {
2 return (f == 0.0f || Float.isNaN(f)) ? f : copySign(1.0f, f);
3}
We can now turn to the generated aarch64 assembly to see what happens once the code runs on an actual device:
1fcmp s0, #0.0
2b.eq #+0x28 (addr 0x266c)
3fcmp s0, s0
4b.ne #+0x20 (addr 0x266c)
5fmov s1, #0x70 (1.0000)
6fmov w0, s0
7and w0, w0, #0x80000000
8fmov w1, s1
9and w1, w1, #0x7fffffff
10orr w0, w0, w1
11fmov s0, w0
12ret
The first interesting thing we can notice is that both isNaN()
and copySign()
disappear as
function calls and are instead replaced with their implementation, via a combination of inlining
and instrinsics in ART.
The code is a pretty direct translation of the original Java source:
- Lines 1 and 2 check if the value is
0.0f
- Lines 3 and 4 check if the value is
NaN
1 - And the rest implements
copySign()
So… all good? I could describe the assembly step by step but it will be easier for most readers
if we look at the Java implementation of copySign()
directly:
1public static float copySign(float magnitude, float sign) {
2 return Float.intBitsToFloat(
3 (Float.floatToRawIntBits(sign) & (FloatConsts.SIGN_BIT_MASK)) |
4 (Float.floatToRawIntBits(magnitude) & (FloatConsts.EXP_BIT_MASK | FloatConsts.SIGNIF_BIT_MASK))
5 );
6}
What looks like a lot of scary looking bit manipulation is actually fairly simple and relies on a simple fact: the most-significant bit (MSB) of a float (or double) encodes the sign of the number. When the MSB is set to 1, the number is negative, otherwise the number is positive.
This code also uses floatToRawIntBits()
to get the bit representation of a float as an integer.
Given this information, the code should be easier to follow:
- First we mask the bit representation of the
sign
input with0x80000000
to extract the sign bit - Then we mask the bit representation of the
magnitude
input with0x7fffffff
to extract all the bits except the sign bit - We combine both with a binary OR
Go back to the aarch64 assembly above, and you’ll see it’s exactly what happens (hint: fmov
is
what does both intBitsToFloat()
and floatToRawIntBits()
at the assembly level, and I will have
something interesting to say about this in a future post) . This tells us something interesting
about copySign()
: it is not an intrinsic, but it is inlined, and the functions it calls are
themselves intrinsics. All function calls disappear.
So far so good, but if we look more closely at the assembly we can notice something rather silly.
The code spends a few instructions loading the constant 1.0f
just to extract its non-sign bits
(the exponent and significand):
1fcmp s0, #0.0
2b.eq #+0x28 (addr 0x266c)
3fcmp s0, s0
4b.ne #+0x20 (addr 0x266c)
5fmov s1, #0x70 (1.0000)
6fmov w0, s0
7and w0, w0, #0x80000000
8fmov w1, s1
9and w1, w1, #0x7fffffff
10orr w0, w0, w1
11fmov s0, w0
12ret
But 1.0f
is a known constant, and instead of extracting its bits at runtime we could just… use
the hexadecimal representation of 1.0f
, 0x3f800000
.
This is something you would normally expect a compiler or an optimizer to do when executing a pass of constant folding. Unfortunately ART currently does not perform constant folding through intrinsics2.
So what can we do about it? We can rewrite Float.sign
/signum()
to bypass
copySign()
/withSign()
and do the sign copy ourselves. Here’s a Kotlin version:
1public inline val Float.sign: Float get() = if (this == 0.0f || isNaN()) {
2 this
3} else {
4 Float.fromBits((toRawBits() and 0x80000000.toInt()) or 0x3f800000)
5}
Doing this saves two instructions in the generated aarch64 code, and this optimization will be
delivered in a future update of libcore
, ART’s standard library:
1fcmp s0, #0.0
2b.eq #+0x20 (addr 0x26a4)
3fcmp s0, s0
4cset w0, ne
5cbnz w0, #+0x14 (addr 0x26a4)
6fmov w0, s0
7and w0, w0, #0x80000000
8orr w0, w0, #0x3f800000
9fmov s0, w0
10ret
It is interesting to note that swapping implementation has the side-effect of changing how isNaN()
is performed. Instead of a comparison (fcmp
) and a jump (b.ne
), we now use a comparison followed
by cset
and cbnz
. This behavior is apparently caused by the different code generation paths
taken in both cases (inlining vs not), and this means we could in theory save another instruction.
Update: Thanks to Pete Cawley’s suggestion, I tried a C++ implementation. The C++ version is just a straight port of the Java/Kotlin implementation:
1__attribute__((always_inline))
2inline uint32_t to_uint32(float x) {
3 uint32_t a;
4 std::memcpy(&a, &x, sizeof(x));
5 return a;
6}
7
8__attribute__((always_inline))
9inline float to_float(uint32_t x) {
10 float a;
11 std::memcpy(&a, &x, sizeof(x));
12 return a;
13}
14
15float sign(float x) {
16 if (x == 0.0f || std::isnan(x)) {
17 return x;
18 } else {
19 uint32_t d = to_uint32(x);
20 return to_float((d & 0x80000000) | 0x3f800000);
21 }
22}
With this implementation, the compiler (clang 17.0) will produce the following aarch64 code:
1fmov w8, s0
2fcmp s0, #0.0
3and w8, w8, #0x80000000
4orr w8, w8, #0x3f800000
5fmov s1, w8
6fcsel s1, s0, s1, eq
7fcsel s0, s0, s1, vs
8ret
This new version saves another 2 instructions, for a total of 4 instructions (30%) compared to the
original implementation (and it’s branchless!). This solution relies on the fact that fcmp
will
set the overflow flag when either operand is NaN
. Since we perform a comparison to 0.0f
which
we know cannot be NaN
, we can check the V
flag to know if the input is NaN
. It is achieved
above using the fcsel
instruction and the vs
operator. This means that unless ART could perform
the same optimization automatically, it might be worth implementing Math.signum
as an intrinsic.
Next time, we’ll take a look at one of the following topics:
floatToRawIntBits()
and a not-so-micro micro-optimization- Optimizing value classes
- Using a better
HashMap
- Faster range-checks
- Optimizing code size with de-inlining
- Optimizing a text parser