I've been hacking on DataStation for a year now so time for a retro and what's next!
4k+ stars in 15 repos. 7 posts on the front of HN. Dozens of failed investor chats (including YC).
tldr; DataStation has a bright future & I'm on the job market. :)
The results are in for April:
My programmatic SEO experiment is killing it 6 months in! Over 2k visitors in April alone. That's +226% from last month!
All the nitty gritty details in my April retrospective: https://allisonseboldt.com/april-2022/
Update! @wmertens and @dholth showed me a simpler way to achieve the same performance without storing the file size redundantly. Creating an index achieves the same thing.
The last tricky part was writing a SQL migration to populate the sizes of files that were uploaded and stored in the DB before this change. I've never written an update that derives from other data in the DB before, but it wasn't too hard.
And we have a winner! For the same 1.1 GB file, latency dropped from 9s to 9ms, a 100x speedup.
Next, I tried storing the file size along with the file metadata
This surprised me, and I still don't have a good explanation for it. It's 3,708 rows, so it doesn't seem like it should take SQLite *that* long to calculate the SUM of 3708 values.
I'm guessing the large blob in each row slows down the query even though we don't read it.
I tried running the SQL on its own, and it took 743 ms
Storing the chunk size worked, and it brought the latency down from 9s to 839ms, a 10x performance boost.
But 839ms to calculate the size of a single file was still pretty slow...
But based on the 9s latency, calculating sizes on the fly wasn't going to work.
My first thought was to store the chunk size alongside the blob in the table containing file data. That had the advantage of keeping size close to the data it described.
I checked the SQLite docs. They didn't explicitly say that LENGTH reads the full blob data, but it suggested for strings, it calculated length on the fly by looking for the first null byte. I'm assuming for BLOB types, SQLite iterates through the full contents of the column
It worked! Page load time dropped to 8ms.
I was confident that the SUM(LENGTH(chunk)) line was causing the latency.
I tried removing the LENGTH() function and just hardcoding the size calculation to 1.
I was able to reproduce the issue locally by uploading a 1.1 GB file. The page load time jumped to 9.3 seconds.
Solo developer. Lover of unit tests. Builder of TinyPilot. ex-Google, ex-Microsoft
Michael Lynch's personal Mastodon instance