BlurHash is a compact representation of placeholders for images. A blur hash is encoded as a short string that can be rendered to a bitmap at runtime to display a “blurry” version of the source image. The way it works remind me of how spherical harmonics are used in 3D rendering engines to efficiently encode irradiance.
I recently remembered that I had been meaning to look at the Kotlin implementation of BlurHash to see if there was a way to make it faster. I looked up the KMP port and after a few changes, I was able to make decoding a blur hash up to 4.35x faster on a Pixel 6 running Android 14. For all the measurements mentioned in this article, I decoded a fixed blur hash to a 384x384 image. The original code took 24.4 ms to do this when running at full speed, and 41.5 ms when locking the CPU clocks. This is way too long considering the low resolution of the image.
We will go step by step, and the full final implementation will be provided at the end of this article.
Squaring numbers Link to heading
The original code uses a function called signPow
to square values while retaining the original sign. This function
is a generic function that can be used with any power, but during decoding it is only used to
square values. Since pow()
is expensive compared to a multiplication, my first step was to
introduce a new, more specialized function and use that instead:
1inline fun signedSquare(value: Float) = (value * value).withSign(value)
Character lookups Link to heading
Before generating the final bitmap, the blur hash algorithm initializes a set of basis colors by decoding them from the hash string, using a “base 83” encoding. The next optimization step was to optimize the following function:
1fun decode83(value: String, from: Int = 0, to: Int = value.length): Int {
2 var result = 0
3 val chars = value.toCharArray()
4 for (i in from until to) {
5 result = result * CHARS.size + CHARS.indexOf(chars[i])
6 }
7 return result
8}
There are two major issues with this function:
- The input string is converted to a
CharArray
on every call. - Each character in the array is used to perform a linear search through an array of characters
called
CHARS
.
The input string isn’t large but this is still unnecessary work that can easily be optimized away in two steps:
- Don’t convert the input string to an array and simply use the
get()
operator to access each character. - Create a new lookup table that maps a character’s code to its index in the
CHARS
array.
Here is the new function:
1fun decode83(value: String, from: Int = 0, to: Int = value.length): Int {
2 var result = 0
3 val lut = IndexOfChars
4 for (i in from until to) {
5 result = result * 83 + lut[value[i].code]
6 }
7 return result
8}
The IndexOfChars
array is a constant ByteArray
generated once offline. This optimiation, combined
with the previous one, improves performance by roughly 6%.
Lookup table, again Link to heading
Another initialization step requires applying the sRGB transfer function to each channel of a decoded color. Since we are working with colors encoded with 8 bits per channel, we only have 256 possible values to convert. We can speed this up by using a lookup table:
1val SrgbToLinear = FloatArray(256) {
2 srgbToLinear(it)
3}
4
5internal fun decodeDc(colorEnc: Int, colors: FloatArray, offset: Int) {
6 val lut = SRGB_TO_LINEAR
7 colors[offset + 0] = lut[(colorEnc shr 16) and 255]
8 colors[offset + 1] = lut[(colorEnc shr 8) and 255]
9 colors[offset + 2] = lut[colorEnc and 255]
10}
While we’re at it, we can also avoid returning a new float array every time, and instead write the values directly into the destination.
Caches and auto-boxing Link to heading
The loop used to generate the bitmap needs to compute a number of cosines. To avoid doing repeated
heavy computations, the original implementation generates cosine tables and
caches them in a hashmap. Unfortunately these maps are regular HashMap
instances using Int
as key,
and since the key is a “large” value (width * componentX
, where width
is the width of the
generated image in pixels), we’ll cause boxing. It’s not a big deal since we are allocating large
arrays for the bitmap but we might as well fix it.
To address this, I switched to an IntObjectMap
from the androidx.collection
library. It removes
all boxing, and provides faster lookups, for a gain of about 2.5%.
Flattening arrays Link to heading
The initialization step of the algorithm populates a structured called BlurHashInfo
which holds
an Array<FloatArray>
.
The outer array contains x * y
FloatArray
instances, where x
and y
are the color basis.
Each FloatArray
is an array of 3 floats, originally created by the decodeDc()
function we saw
above.
The next step in our optimization adventure is to flatten these arrays into a single FloatArray
with a size of x * y * 3
. This provides better cache locality and gives a speed up of ~7%.
Factor out computations Link to heading
Let’s look at the main loop that generates pixels now. The original code has the following shape:
1for (y in 0..<height) {
2 for (x in 0..<width) {
3 for (j in 0..<componentY) {
4 for (i in 0..<componentX) {
5 }
6 }
7 }
8}
The bulk of the work happens in the fourth nested loop. In particular, the code
computes or looks up cosines. The “Y” cosine only depends on height
and componentY
but is re-computed
for every value of i
. An easy optimization is to just move the “Y” cosine lookup one loop up to
win another 13%. Now that we’re in the main loop, small changes lead to big results.
Getting rid of branches Link to heading
Deep inside those nested loop, the implementation calls getCos()
to either compute a cosine or
read it from the caches we mentioned earlier. This means however that every single iteration will
test whether we should compute the value or fetch it
from the cache. This is expensive because we only need to compute the value the very first time, and
yet we’ll run the same test for every pixel.
In this case it’s better to do the work upfront and to simply populate the caches before we iterate over the pixels. This removes a large number of branches, yielding another 20% performance improvement.
Faster pre-caching Link to heading
We’ve now moved the computation of the cosines outside of the loops, but we still to compute quite a
few of them (width * componentX
and height * componentY
):
1for (x in 0..<width) {
2 for (i in 0..<componentX) {
3 val index = i + componentX * x
4 cacheCosineX[index] = cos(PI * x * i / width).toFloat()
5 }
6}
We can factor out the multiplication by PI
and the division by width
:
This doesn’t look like much but this scores us another ~1.5%.
Yet another lookup table Link to heading
Let’s go back to the main loop. Once we’ve computed a pixel’s color, we need to
re-apply the sRGB transfer function
to get a displayable sRGB value. The transfer function executes
a branch and a call to pow()
, for every pixel. Just like we did during the initialization of the
algorithm, we can replace this function with a lookup table.
Our inputs are floats, but since we output 8-bit colors, we only need 256 entries in our lookup table:
1val LinearToSrgb = IntArray(256) {
2 linearToSrgb(it / 255f)
3}
4
5val red = linearToSrgb[(r * 255.0f + 0.5f).toInt()]
6val green = linearToSrgb[(g * 255.0f + 0.5f).toInt()]
7val blue = linearToSrgb[(b * 255.0f + 0.5f).toInt()]
We just hit our most important optimization. With this optimization, the algorithm now runs 2.4x faster compared to the previous step.
Merging loops Link to heading
Remember these nested loops? We can go faster by merging the two inner loops. To achieve this,
I changed the cosine tables to cache width * componentX * componentY
(or height
for the other
table), instead of width * componentX
. This costs us a bit more memory and initialization time
(componentX
and componentY
where 5 and 6 in my tests), but this allows us to rewrite the
pixel generation loop this way:
1for (y in 0..<height) {
2 for (x in 0..<width) {
3 var r = 0f
4 var g = 0f
5 var b = 0f
6
7 for (i in 0..<componentCount) {
8 val cosX = cosinesX[x * componentCount + i]
9 val cosY = cosinesY[y * componentCount + i]
10 val basis = cosX * cosY
11 val index = i * 3
12 r += colors[index + 0] * basis
13 g += colors[index + 1] * basis
14 b += colors[index + 2] * basis
15 }
16
17 val red = linearToSrgb[(r * 255.0f + 0.5f).toInt()]
18 val green = linearToSrgb[(g * 255.0f + 0.5f).toInt()]
19 val blue = linearToSrgb[(b * 255.0f + 0.5f).toInt()]
20 // setPixel(red, green, blue)
21 }
22}
Fusing the loops nets another 10% improvement.
Graphics == approximation Link to heading
When optimizing graphics code, it’s always helpful to ask yourself whether you need the output to
be perfectly correct, or good enough. Since we are generating (extremely) blurry proxies of the
source images, we do not need to be perfectly correct. This means we don’t have to use the
exact cos()
function, but an approximation instead:
1inline fun normalizedAngleSin(normalizedAngle: Float): Float {
2 val angle = normalizedAngle - floor(normalizedAngle + 0.5f)
3 val x = abs(2.0f * angle)
4 val a = 1.0f - x
5 return 8.0f * angle * a / (1.25f - x * a)
6}
7
8inline fun normalizedAngleCos(normalizedAngle: Float): Float =
9 normalizedAngleSin(normalizedAngle + 0.25f)
The maximum error of this function is ~0.0935 degrees compared to cos()
but who cares? It’s a
blurry picture anyway. This function expects a value between 0 and 1, which means we need to divide
our angle by 2.0 * PI
. We already factored out part of the call to cos()
earlier:
1val divisor = (PI / width).toFloat()
After dividing by 2.0 * PI
we are left with:
1val divisor = (1f / (2f * width)).toFloat()
Hey, someone did this already! (almost) Link to heading
While I was teasing this on Bluesky, Christophe Beyls asked me whether I compared my work to his own optimization to the original code the KMP port seems to be based on. I had not, and looking at his changelog, we both implemented many of the same optimizations!
The main differences between the two are:
- Use a lookup table from sRGB to linear.
- Use a lookup table from linear to sRGB.
- Use an approximation of
cos()
. - Eliminate the linear search through the
CHARS
array/string. - Merge the two inner loops in the bitmap generation code.
- Small optimizations in
decodeAc()
.
I ported all my extra optimizations to Christophe’s implementation, and the new version is 2.8x faster on a Pixel 6. I posted the full source code as a gist, feel free to check it out to see what the final code looks like.
What next? Link to heading
Could we do better? Absolutely. The pixels generation loop screams “SIMD” and could benefit from proper vectorization. I measured my final version at ~5ms on a Pixel 6, which still feels absurdly slow considering the size of the generated image.
I didn’t spend time looking at the generated aarch64 code on Android to see what else could be done, but it wouldn’t surprised me if there are still many opportunities for improvements.