I had an interesting time investigating a bug related to SQLite performance today, so I thought I'd share a thread.

A PicoShare user reported that it took 14.57 seconds to load their list of files.

The first thing I noticed was that they had uploaded a 1.24 GB file. The files I use with PicoShare tend to be images and documents in the 5-50 MB range, so 1.24 GB is larger than I've tested with.

Follow

One of the wacky architectural choices I made with PicoShare was to store all file data in SQLite as BLOB data. This is an unusual choice, as most apps store file uploads on the filesystem. But putting it in SQLite means I get cloud replication "for free" with @litestreamio

Storing file data in SQLite leads to odd behavior like this, so I suspected the bottleneck was somewhere in my SQLite logic.

Here's the code to retrieve file metadata. I keep one table for file metadata and another table for file data, broken into 32 KB chunks.

I suspected that the problem was on the SUM(LENGTH(chunk)) line because it's the only one that's touching the file data.

I was able to reproduce the issue locally by uploading a 1.1 GB file. The page load time jumped to 9.3 seconds.

I tried removing the LENGTH() function and just hardcoding the size calculation to 1.

It worked! Page load time dropped to 8ms.

I was confident that the SUM(LENGTH(chunk)) line was causing the latency.

Does calling LENGTH cause SQLite to read the full BLOB data into memory? That would certainly explain why it took 9 seconds to calculate the length of a 1.1 GB file.

I checked the SQLite docs. They didn't explicitly say that LENGTH reads the full blob data, but it suggested for strings, it calculated length on the fly by looking for the first null byte. I'm assuming for BLOB types, SQLite iterates through the full contents of the column

I had specifically avoided storing the file size in my SQLite DB because it was redundant. The raw data is there, so we can always derive the size. It shouldn't be something we store independently.

But based on the 9s latency, calculating sizes on the fly wasn't going to work.

My first thought was to store the chunk size alongside the blob in the table containing file data. That had the advantage of keeping size close to the data it described.

Storing the chunk size worked, and it brought the latency down from 9s to 839ms, a 10x performance boost.

But 839ms to calculate the size of a single file was still pretty slow...

This surprised me, and I still don't have a good explanation for it. It's 3,708 rows, so it doesn't seem like it should take SQLite *that* long to calculate the SUM of 3708 values.

I'm guessing the large blob in each row slows down the query even though we don't read it.

Next, I tried storing the file size along with the file metadata

And we have a winner! For the same 1.1 GB file, latency dropped from 9s to 9ms, a 100x speedup.

I still didn't love the idea of storing the redundant file size, and I considered alternatively creating a virtual SQLite table.

But we'd still have to populate the virtual table still at app load time, which would take ~10s per GB of data
sqlite.org/vtab.html

The last tricky part was writing a SQL migration to populate the sizes of files that were uploaded and stored in the DB before this change. I've never written an update that derives from other data in the DB before, but it wasn't too hard.

Update! @wmertens and @dholth showed me a simpler way to achieve the same performance without storing the file size redundantly. Creating an index achieves the same thing.

Creating the index is a one-line fix, and it doesn't have the drawback my original PR did of storing a redundant copy of the file size. github.com/mtlynch/picoshare/p

Sign in to participate in the conversation
Michael Lynch's Mastodon

Michael Lynch's personal Mastodon instance