Optimization, Step by Step

BlurHash is a compact representation of placeholders for images. A blur hash is encoded as a short string that can be rendered to a bitmap at runtime to display a “blurry” version of the source image. The way it works remind me of how spherical harmonics are used in 3D rendering engines to efficiently encode irradiance.

I recently remembered that I had been meaning to look at the Kotlin implementation of BlurHash to see if there was a way to make it faster. I looked up the KMP port and after a few changes, I was able to make decoding a blur hash up to 4.35x faster on a Pixel 6 running Android 14. For all the measurements mentioned in this article, I decoded a fixed blur hash to a 384x384 image. The original code took 24.4 ms to do this when running at full speed, and 41.5 ms when locking the CPU clocks. This is way too long considering the low resolution of the image.

Info

It is (almost) always easier to take existing code and find ways to optimize it than to write said code in the first place. Do not take what follows as a judgment of any kind against the original implementation.

Warning

I unfortunately messed up my intermediate measurements when trying to keep track of the impact of each successive optimization, which is why from there on I will only provide rough percentages instead of specific timings (I forgot to lock my device’s CPU clocks, ugh). There are also smaller optimizations I made at various points (inlining, etc.) without keeping track. The 4.35x speedup noted above is however accurate.

We will go step by step, and the full final implementation will be provided at the end of this article.

Squaring numbers Link to heading

The original code uses a function called signPow to square values while retaining the original sign. This function is a generic function that can be used with any power, but during decoding it is only used to square values. Since pow() is expensive compared to a multiplication, my first step was to introduce a new, more specialized function and use that instead:

1inline fun signedSquare(value: Float) = (value * value).withSign(value)

Character lookups Link to heading

Before generating the final bitmap, the blur hash algorithm initializes a set of color basis by decoding them from the hash string, using a base 83 encoding. The next optimization step was to optimize the following function:

1fun decode83(value: String, from: Int = 0, to: Int = value.length): Int {
2    var result = 0
3    val chars = value.toCharArray()
4    for (i in from until to) {
5        result = result * CHARS.size + CHARS.indexOf(chars[i])
6    }
7    return result
8}

There are two major issues with this function:

The input string is converted to a CharArray on every call.
Each character in the array is used to perform a linear search through an array of characters called CHARS.

The input string isn’t large but this is still unnecessary work that can easily be optimized away in two steps:

Don’t convert the input string to an array and simply use the get() operator to access each character.
Create a new lookup table that maps a character’s code to its index in the CHARS array.

Here is the new function:

1fun decode83(value: String, from: Int = 0, to: Int = value.length): Int {
2    var result = 0
3    val lut = IndexOfChars
4    for (i in from until to) {
5        result = result * 83 + lut[value[i].code]
6    }
7    return result
8}

The IndexOfChars array is a constant ByteArray generated once offline. This optimization, combined with the previous one, improves performance by roughly 5%.

Lookup table, again Link to heading

Another initialization step requires applying the sRGB transfer function to each channel of a decoded color. Since we are working with colors encoded with 8 bits per channel, we only have 256 possible values to convert. We can speed this up by using a lookup table:

 1val SrgbToLinear = FloatArray(256) {
 2    srgbToLinear(it)
 3}
 4
 5internal fun decodeDc(colorEnc: Int, colors: FloatArray, offset: Int) {
 6    val lut = SRGB_TO_LINEAR
 7    colors[offset + 0] = lut[(colorEnc shr 16) and 255]
 8    colors[offset + 1] = lut[(colorEnc shr 8) and 255]
 9    colors[offset + 2] = lut[colorEnc and 255]
10}

While we’re at it, we can also avoid returning a new float array every time, and instead write the values directly into the destination.

Caches and auto-boxing Link to heading

The loop used to generate the bitmap needs to compute a number of cosines. To avoid doing repeated heavy computations, the original implementation generates cosine tables and caches them in a hashmap. Unfortunately these maps are regular HashMap instances using Int as key, and since the key is a “large” value (width * componentX, where width is the width of the generated image in pixels), we’ll cause boxing. It’s not a big deal since we are allocating large arrays for the bitmap but we might as well fix it.

To address this, I switched to an IntObjectMap from the androidx.collection library. It removes all boxing, and provides faster lookups, for a gain of about 2.5%.

Flattening arrays Link to heading

The initialization step of the algorithm populates a structured called BlurHashInfo which holds an Array<FloatArray>. The outer array contains x * y FloatArray instances, where x and y are the color basis. Each FloatArray is an array of 3 floats, originally created by the decodeDc() function we saw above.

The next step in our optimization adventure is to flatten these arrays into a single FloatArray with a size of x * y * 3. This provides better cache locality and gives a speed up of ~7%.

Factor out computations Link to heading

Let’s look at the main loop that generates pixels now. The original code has the following shape:

1for (y in 0..<height) {
2    for (x in 0..<width) {
3        for (j in 0..<componentY) {
4            for (i in 0..<componentX) {
5            }
6        }
7    }
8}

The bulk of the work happens in the fourth nested loop. In particular, the code computes or looks up cosines. The “Y” cosine only depends on height and componentY but is re-computed for every value of i. An easy optimization is to just move the “Y” cosine lookup one loop up to win another 13%. Now that we’re in the main loop, small changes lead to big results.

Getting rid of branches Link to heading

Deep inside those nested loop, the implementation calls getCos() to either compute a cosine or read it from the caches we mentioned earlier. This means however that every single iteration will test whether we should compute the value or fetch it from the cache. This is expensive because we only need to compute the value the very first time, and yet we’ll run the same test for every pixel.

In this case it’s better to do the work upfront and to simply populate the caches before we iterate over the pixels. This removes a large number of branches, yielding another 20% performance improvement.

Faster pre-caching Link to heading

We’ve now moved the computation of the cosines outside of the loops, but we still to compute quite a few of them (width * componentX and height * componentY):

1for (x in 0..<width) {
2    for (i in 0..<componentX) {
3        val index = i + componentX * x
4        cacheCosineX[index] = cos(PI * x * i / width).toFloat()
5    }
6}

We can factor out the multiplication by PI and the division by width:

1val divisor = (PI / width).toFloat()
2cacheCosineX[index] = cos(x * i * divisor).toFloat()

This doesn’t look like much but this scores us another ~1.5%.

Yet another lookup table Link to heading

Let’s go back to the main loop. Once we’ve computed a pixel’s color, we need to re-apply the sRGB transfer function to get a displayable sRGB value. The transfer function executes a branch and a call to pow(), for every pixel. Just like we did during the initialization of the algorithm, we can replace this function with a lookup table.

Our inputs are floats, but since we output 8-bit colors, we only need 256 entries in our lookup table:

1val LinearToSrgb = IntArray(256) {
2    linearToSrgb(it / 255f)
3}
4
5val red   = linearToSrgb[(r * 255.0f + 0.5f).toInt()]
6val green = linearToSrgb[(g * 255.0f + 0.5f).toInt()]
7val blue  = linearToSrgb[(b * 255.0f + 0.5f).toInt()]

We just hit our most important optimization. With this optimization, the algorithm now runs 2.4x faster compared to the previous step.

Merging loops Link to heading

Remember these nested loops? We can go faster by merging the two inner loops. To achieve this, I changed the cosine tables to cache width * componentX * componentY values (or height for the other table), instead of width * componentX. This costs us a bit more memory and initialization time (componentX and componentY were set to 5 in my tests), but this allows us to rewrite the pixel generation loop this way:

 1for (y in 0..<height) {
 2    for (x in 0..<width) {
 3        var r = 0f
 4        var g = 0f
 5        var b = 0f
 6
 7        for (i in 0..<componentCount) {
 8            val cosX = cosinesX[x * componentCount + i]
 9            val cosY = cosinesY[y * componentCount + i]
10            val basis = cosX * cosY
11            val index = i * 3
12            r += colors[index + 0] * basis
13            g += colors[index + 1] * basis
14            b += colors[index + 2] * basis
15        }
16
17        val red = linearToSrgb[(r * 255.0f + 0.5f).toInt()]
18        val green = linearToSrgb[(g * 255.0f + 0.5f).toInt()]
19        val blue = linearToSrgb[(b * 255.0f + 0.5f).toInt()]
20        // setPixel(red, green, blue)
21    }
22}

Fusing the loops nets another 10% improvement.

Graphics == approximation Link to heading

When optimizing graphics code, it’s always helpful to ask yourself whether you need the output to be perfectly correct, or good enough. Since we are generating (extremely) blurry proxies of the source images, we do not need to be perfectly correct. This means we don’t have to use the exact cos() function, but an approximation instead:

1inline fun normalizedAngleSin(normalizedAngle: Float): Float {
2    val angle = normalizedAngle - floor(normalizedAngle + 0.5f)
3    val x = abs(2.0f * angle)
4    val a = 1.0f - x
5    return 8.0f * angle * a / (1.25f - x * a)
6}
7
8inline fun normalizedAngleCos(normalizedAngle: Float): Float =
9    normalizedAngleSin(normalizedAngle + 0.25f)

The maximum error of this function is ~0.0935 degrees compared to cos() but who cares? It’s a blurry picture anyway. This function expects a value between 0 and 1, which means we need to divide our angle by 2.0 * PI. We already factored out part of the call to cos() earlier:

1val divisor = (PI / width).toFloat()

After dividing by 2.0 * PI we are left with:

1val divisor = (1f / (2f * width)).toFloat()

Hey, someone did this already! (almost) Link to heading

While I was teasing this article on Bluesky, Christophe Beyls asked me whether I compared my work to his own optimization to the original code the KMP port seems to be based on. I had not, and looking at his changelog, we both implemented many of the same optimizations!

The main differences between the two are:

Use a lookup table from sRGB to linear.
Use a lookup table from linear to sRGB.
Use an approximation of cos().
Eliminate the linear search through the CHARS array/string.
Merge the two inner loops in the bitmap generation code.
Small optimizations in decodeAc().

I ported all my extra optimizations to Christophe’s implementation, and the new version is 2.8x faster on a Pixel 6. I posted the full source code as a gist, feel free to check it out to see what the final code looks like.

What next? Link to heading

Could we do better? Absolutely. The pixels generation loop screams “SIMD” and could benefit from proper vectorization. I measured my final version at ~5ms on a Pixel 6, which still feels absurdly slow considering the size of the generated image.

I didn’t spend time looking at the generated aarch64 code on Android to see what else could be done, but it wouldn’t surprised me if there are still many opportunities for improvements.