Leland and I were recently discussing how to best implement
a new data structure to speed up a specific aspect of Jetpack Compose. He came up with a great
idea, and nerd sniped me in the process. The problem was to efficiently
encode the occupancy of an 8x8 grid represented as a Long
(each bit representing a cell in the
grid).
After coming up with the bit twiddling code that quickly “rasterizes” a rectangle into the grid as a bitfield, I found myself thinking about how incredibly helpful Kotlin can be at making low-level/micro-optimized code easy to read1 for users of an API.
The grid Link to heading
To understand what the code that follows does, here is an example of a grid initialized with a rectangle spanning from the coordinate (1,1) to (5,5). Note that the left/top coordinates are inclusive while right/bottom are exclusive.
Gives the following output:
00000000
01111000
01111000
01111000
01111000
00000000
00000000
00000000
Like mentioned earlier, we want to encode the grid in a Long
and Kotlin offers a great way
to do so while maintaining type safety and a nice API surface thanks to value classes:
1inline fun Grid() = Grid(0L)
2inline fun Grid(r: Rect) = Grid(rasterize(r.l, r.t, r.r, r.b))
3inline fun Grid(l: Int, t: Int, r: Int, b: Int) = Grid(rasterize(l, t, r, b))
4
5@JvmInline
6value class Grid @PublishedApi internal constructor(@PublishedApi internal val cells: Long) {
7 override fun toString() = buildString {
8 for (y in 0..7) {
9 val line = (cells ushr (56 - y shl 3) and 0xffL).toString(2).padStart(8, '0')
10 appendLine(line)
11 }
12 }
13}
14
15@PublishedApi
16internal fun rasterize(l: Int, t: Int, r: Int, b: Int): Long {
17 val w = r - l
18 val h = b - t
19 val scanline = 0xffL ushr (8 - w) shl (8 - r)
20 val rows = 0x01_01_01_01_01_01_01_01L ushr ((8 - h) shl 3) shl ((8 - b) shl 3)
21 return rows * scanline
22}
Let’s ignore the definition of Rect
for now, just know that it’s a type that exposes 4 properties
call l
, t
, r
, and b
for left, top, right, and bottom respectively. If we were to ship this
code as a public API we’d change the properties to have longer names, but we’ll keep the single
letter versions to keep our code concise here.
With this code, we can now construct grid “instances” and benefit from the aforementioned type safety without paying the cost of any abstraction.
We actually pay so little that building a grid with a constant rectangle is effectively free. The following Kotlin function:
1fun defaultGrid() = Grid(1, 1, 5, 5)
Yields the following DEX bytecode:
While this code was compiled with R82, which means extra optimizations were applied, the result
is quite impressive given how rasterize()
is implemented.
Adding operators Link to heading
With a working value type, we can now add a series of APIs to simplify grid manipulations. The ability to overload operators proves to be particularly useful in a case like this. With these operators we can now perform interesting operations on the grid, such as adding rectangles to produce the union:
1var grid = Grid(1, 1, 5, 5)
2grid += Rect(3, 3, 7, 7)
3println(grid)
4
5// Output:
6// 00000000
7// 01111000
8// 01111000
9// 01111110
10// 01111110
11// 00011110
12// 00011110
13// 00000000
Or subtracting a rectangle:
1var grid = Grid(1, 1, 5, 5)
2grid -= Rect(2, 2, 6, 6)
3println(grid)
4
5// Output:
6// 00000000
7// 01111000
8// 01000000
9// 01000000
10// 01000000
11// 00000000
12// 00000000
13// 00000000
Or querying if a cell is occupied:
These operators are fairly trivial to implement, and more importantly extremely efficient once compiled:
1inline operator fun get(x: Int, y: Int) =
2 ((cells ushr ((7 - y) shl 3)) and (0x1L shl (7 - x))) != 0L
3
4inline operator fun plus(r: Rect) =
5 Grid(cells or rasterize(r.l, r.t, r.r, r.b))
6
7inline operator fun minus(r: Rect) =
8 Grid(cells and rasterize(r.l, r.t, r.r, r.b).inv())
9
10inline infix fun and(r: Rect) =
11 Grid(cells and rasterize(r.l, r.t, r.r, r.b))
12
13inline fun intersects(r: Rect) =
14 (cells and rasterize(r.l, r.t, r.r, r.b)) != 0L
We could keep going and add other operators such as xor
or provide functions that operate
on two grids (the union of two grids for instance would just be a plus
operator that or
s
the cells
value of both grids).
Higher level APIs Link to heading
A natural thing we may want to do with our grid is iterate over all occupied cells. We can easily
add this capability with a forEach
function that takes a lambda:
1inline fun forEach(block: (Int, Int) -> Unit) {
2 var v = cells
3 while (v.hasNext()) {
4 val index = v.get()
5 block(index and 0x7, index ushr 3)
6 v = v.next()
7 }
8}
9
10@PublishedApi
11internal inline fun Long.get() = 63 - countTrailingZeroBits()
12
13@PublishedApi
14internal inline fun Long.hasNext() = this != 0L
15
16@PublishedApi
17internal inline fun Long.next() = this and (this - 1L)
The resulting aarch64 code is nice and tidy since countTrailingZeroBits()
is an ART intrinsic
that compiles down to rbit
+ clz
. Also note how we use extension functions to make our own
code easier to read at no extra cost.
Hiding the madness Link to heading
With all of the above, the users of our API can be completely oblivious to what’s going on in our code and they can happily write highly readable code such as:
1var grid = Grid(1, 1, 5, 5)
2grid += Rect(3, 3, 7, 7)
3grid = grid and Rect(5, 5, 8, 8)
4
5grid.forEach { x, y ->
6 process(x, y)
7}
All of our abstractions will just go away and leave behind only highly optimized code. This
forEach
loop for instance:
Will produce the following aarch64 code:
1 mov w2, #0x3f
2 mov w0, #0x0
3 cmp x1, #0x0 (0)
4 cset w3, ne
5 cbz x1, #+0x2c
6 rbit x3, x1
7 clz x3, x3
8 sub w3, w2, w3
9 lsr w4, w3, #3
10 and w3, w3, #0x7
11 sub x5, x1, #0x1 (1)
12 madd w0, w3, w4, w0
13 and x1, x1, x5
14 ldr x21, [x21]
15 b #-0x30
It is important to note that this kind of low-level optimized code also helps the compilers apply their own optimizations as the side-effect free code is easier to reason about.
The mighty R8 is impressive Link to heading
We already briefly mentioned R8 above, but let’s look at a larger example. To truly understand
the scope of what R8 does, let’s see how our Rect
works:
1inline fun Int.uncheckedCoerceIn(minimumValue: Int, maximumValue: Int) =
2 this.coerceAtLeast(minimumValue).coerceAtMost(maximumValue)
3
4inline fun Rect(l: Int, t: Int, r: Int, b: Int) =
5 Rect(
6 ((l.uncheckedCoerceIn(0, 8) and 0xf) shl 24) or
7 ((t.uncheckedCoerceIn(0, 8) and 0xf) shl 16) or
8 ((r.uncheckedCoerceIn(0, 8) and 0xf) shl 8) or
9 ((b.uncheckedCoerceIn(0, 8) and 0xf) )
10 )
11
12@JvmInline
13value class Rect @PublishedApi internal constructor(@PublishedApi internal val points: Int) {
14 inline val l: Int get() = points ushr 24
15 inline val t: Int get() = (points shr 16) and 0xf
16 inline val r: Int get() = (points shr 8) and 0xf
17 inline val b: Int get() = points and 0xf
18
19 override fun toString() = "Rect($l, $t, $r, $b)"
20}
Because our grid is an 8x8 grid, we can represent a rectangle using a single integer (we could even
use a Short
). Encoding and decoding the rectangle coordinates is straightforward, we only throw
in a range coercion to guarantee the coordinates are valid3.
Given this definition, you can understand that quite a bit is happening in this code:
We call rasterize()
twice, we pack 4 coordinates into a rectangle, and unpack those 4 coordinates.
Without R8, the generated aarch64 code is:
1sub x16, sp, #0x2000 (8192)
2ldr wzr, [x16]
3str x0, [sp, #-80]!
4stp x22, x23, [sp, #32]
5stp x24, x25, [sp, #48]
6stp x26, lr, [sp, #64]
7ldr x21, [x21]
8mov w25, #0x8
9mov x24, #0xff
10mov x23, #0x101010101010101
11mov w22, #0xf
12adrp x0, #+0x3000 (addr 0x5000)
13adr lr, #+0xc (addr 0x2e9c)
14ldr w0, [x0, #28]
15cbnz x20, #+0x5c8
16cbz w0, #+0x154
17ldrb w16, [x0, #115]
18cmp w16, #0xf0 (240)
19b.lo #+0x148
20mov w2, #0x0
21mov w1, #0x3
22mov w0, #0x3d
23ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
24blr lr
25mov x2, x25
26mov x1, x0
27mov w0, #0x3e
28ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
29blr lr
30and w0, w0, #0xf
31lsl w26, w0, #24
32mov w2, #0x0
33mov w1, #0x3
34mov w0, #0x3d
35ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
36blr lr
37mov x2, x25
38mov x1, x0
39mov w0, #0x3e
40ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
41blr lr
42and w0, w0, #0xf
43orr w26, w26, w0, lsl #16
44mov w2, #0x0
45mov w1, #0x7
46mov w0, #0x3d
47ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
48blr lr
49mov x2, x25
50mov x1, x0
51mov w0, #0x3e
52ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
53blr lr
54and w0, w0, #0xf
55orr w26, w26, w0, lsl #8
56mov w2, #0x0
57mov w1, #0x7
58mov w0, #0x3d
59ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
60blr lr
61mov x2, x25
62mov x1, x0
63mov w0, #0x3e
64ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
65blr lr
66and w0, w0, #0xf
67orr w0, w0, w26
68adrp x1, #+0x3000
69adr lr, #+0xc
70ldr w1, [x1, #32]
71cbnz x20, #+0x448
72cbz w1, #+0x84
73ldrb w16, [x1, #115]
74cmp w16, #0xf0 (240)
75b.lo #+0x78
76and w1, w22, w0, asr #16
77and w2, w22, w0, asr #8
78and w3, w0, #0xf
79sub w0, w2, w0, lsr #24
80sub w1, w3, w1
81sub w0, w25, w0
82lsr x0, x24, x0
83sub w2, w25, w2
84lsl x0, x0, x2
85sub w1, w25, w1
86lsl w1, w1, #3
87lsr x1, x23, x1
88sub w2, w25, w3
89lsl w2, w2, #3
90lsl x1, x1, x2
91mul x0, x0, x1
92mov x16, #0x78000000
93movk x16, #0x7878, lsl #32
94movk x16, #0x78, lsl #48
95orr x0, x0, x16
96ldp x22, x23, [sp, #32]
97ldp x24, x25, [sp, #48]
98ldp x26, lr, [sp, #64]
99add sp, sp, #0x50 (80)
100ret
101mov w0, #0x4
102bl #+0x56c ; pResolveType
103bl #+0x578 ; pInitializeStaticStorage
104b #-0x150 (addr 0x2eac)
105str x0, [sp, #16]
106mov w0, #0x6
107bl #+0x558 ; pResolveType
108bl #+0x564 ; pInitializeStaticStorage
109mov w1, w0
110ldr x0, [sp, #16]
111b #-0x8c
And what if we enable R8? We get a single constant load:
1mov x0, #0x1e00
2movk x0, #0x7e1e, lsl #16
3movk x0, #0x787e, lsl #32
4movk x0, #0x78, lsl #48
5ret
Turn on R8.
The complete version of the code presented here is available as a GitHub gist.
Except the bit twiddling code itself. it remains one of my major complaints about Kotlin. ↩︎
We’ll get back to this with a more complex example. You want to enable R8. ↩︎
We use our own version of
coerceIn()
. The standard one has unwanted side effects which breaks a bunch of optimizations. I’ll maybe write a post about it in the future. ↩︎