Author of the blog here. I had a great time writing this. By far the most complex article I've ever put together, with literally thousands of lines of js to build out these interactive visuals. I hope everyone enjoys.
I've been advocating for SQLite+NVMe for a while now. For me it is a new kind of pattern you can apply to get much further into trouble than usual. In some cases, you might actually make it out to the other side without needing to scale horizontally.<p>Latency is king in all performance matters. <i>Especially</i> in those where items must be processed serially. Running SQLite on NVMe provides a latency advantage that no other provider can offer. I don't think running in memory is even a substantial uplift over NVMe persistence for most real world use cases.
Seeing the disk IO animation reminded me of Melvin Kaye[0]:<p><pre><code> Mel never wrote time-delay loops, either, even when the balky Flexowriter
required a delay between output characters to work right.
He just located instructions on the drum
so each successive one was just past the read head when it was needed;
the drum had to execute another complete revolution to find the next instruction.
</code></pre>
[0] <a href="https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/The_Story_of_Mel.html" rel="nofollow">https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/The...</a>
Metal looks super cool, however at my last job when we tried using instance local SSD's on GCP, there were serious reliability issues (e.g. blocks on the device losing data). Has this situation changed? What machine types are you using?<p>Our workaround was this: <a href="https://discord.com/blog/how-discord-supercharges-network-disks-for-extreme-low-latency" rel="nofollow">https://discord.com/blog/how-discord-supercharges-network-di...</a>
Nice blog. There is also a problem that generally cloud storage is "just unusually slow" (this has been noted by others before, but here is a nice summary of the problem <a href="http://databasearchitects.blogspot.com/2024/02/ssds-have-become-ridiculously-fast.html" rel="nofollow">http://databasearchitects.blogspot.com/2024/02/ssds-have-bec...</a>)<p>Having recently added support for storing our incremental indexes in <a href="https://github.com/feldera/feldera" rel="nofollow">https://github.com/feldera/feldera</a> on S3/object storage (we had NVMe for longer due to obvious performance advantages mentioned in the previous article), we'd be happy for someone to disrupt this space with a better offering ;).
I think something about distributed storage which is not appreciated in this article:<p>1. Some systems do not support replication out of the box. Sure your cassandra cluster and mysql can do master slave replication, but lots of systems cannot.<p>2. Your life becomes much harder with NVME storage in cloud as you need to respect maintenance intervals and cloud initiated drains. If you do not hook into those system and drain your data to a different node, the data goes poof. Separating storage from compute allows the cloud operator to drain and move around compute as needed and since the data is independent from the compute — and the cloud operator manages that data system and draining for that system as well — the operator can manage workload placements without the customer needing to be involved.
This is really cool, and PlanetScale Metal looks really solid, too. Always a huge sucker for seeing latency huge latency drops on releases: <a href="https://planetscale.com/blog/upgrading-query-insights-to-metal" rel="nofollow">https://planetscale.com/blog/upgrading-query-insights-to-met...</a>.
For years, I just didn't get why replicated databases always stick with EBS and deal with its latency. Like, replication is already there, why not be brave and just go with local disks? At my previous orgs, where we ran Elasticsearch for temporary logs/metrics storage, I proposed we do exactly that since we didn't even have major reliability requirements. But I couldn't convince them back then, we ended up with even worse AWS Elasticsearch.<p>I get that local disks are finite, yeah, but I think the core/memory/disk ratio would be good enough for most use cases, no? There are plenty of local disk instances with different ratios as well, so I think a good balance could be found. You could even use local hard disk ones with 20TB+ disks for implementing hot/cold storage.<p>Big kudos to the PlanetScale team, they're like, finally doing what makes sense. I mean, even AWS themselves don't run Elasticsearch on local disks! Imagine running ClickHouse, Cassandra, all of that on local disks.
Really, really great article. The visualization of random writes is very nicely done.<p>On:<p>> Another issue with network-attached storage in the cloud comes in the form of limiting IOPS. Many cloud providers that use this model, including AWS and Google Cloud, limit the amount of IO operations you can send over the wire. [...]<p>> If instead you have your storage attached directly to your compute instance, there are no artificial limits placed on IO operations. You can read and write as fast as the hardware will allow for.<p>I feel like this might be a dumb series of questions, but:<p>1. The ratelimit on "IOPS" is precisely a ratelimit on a particular kind of network traffic, right? Namely traffic to/from an EBS volume? "IOPS" really means "EBS volume network traffic"?<p>2. Does this save me money? And if yes, is from some weird AWS arbitrage? Or is it more because of an efficiency win from doing less EBS networking?<p>I see pretty clearly putting storage and compute on the same machine strictly a latency win, because you structurally have one less hop every time. But is it also a throughput-per-dollar win too?
If this is true, then how do "serverless" database providers like Neon advertise "low latency" access? They use object storage like S3, which I imagine is an order of magnitude worse than networked storage for latency.<p>edit: apparently they build a kafkaesque layer of caching. No thank you, I'll just keep my data on locally attached NVMe.
Great nerdbaiting ad. I read all the way to the bottom of it, and bookmarked it to send to my kids if I feel they are not understanding storage architectures properly. :)
I love the visuals, and if it's ok with you will probably link them to my class material on block devices in a week or so.<p>One small nit:
> A typical random read can be performed in 1-3 milliseconds.<p>Um, no. A 7200 RPM platter completes a rotation in 8.33 milliseconds, so rotational delay for a random read is uniformly distributed between 0 and 8.33ms, i.e. mean 4.16ms.<p>>a single disk will often have well over 100,000 tracks<p>By my calculations a Seagate IronWolf 18TB has about 615K tracks per surface given that it has 9 platters and 18 surfaces, and an outer diameter read speed of about 260MB/s. (or 557K tracks/inch given typical inner and outer track diameters)<p>For more than you ever wanted to know about hard drive performance and the mechanical/geometrical considerations that go into it, see <a href="https://www.msstconference.org/MSST-history/2024/Papers/msst24-1.1.pdf" rel="nofollow">https://www.msstconference.org/MSST-history/2024/Papers/msst...</a>
Disk latency, and one's aversion to it, is IMHO the only way Hetzner costs can run up on you. You want to keep the database on local disk, and not their very slow attached Volumes (Hetzner EBS). In short, you can have relatively light work-loads that will be on sort of expensive VMs because you need 500GB, or more, of local disk. 1TB local disk is the biggest VM they offer in the US. 300 EUR a month.
I'm always curious about latency for all these newdb offerings like PlanetScale/Neon/Supabase.<p>It seems like they don't emphasise strongly enough _make sure you colocate your server in the same cloud/az/region/dc as our db. I suspect a large fraction of their users don't realise this, and have loads of server-db traffic happening very slowly over the public internet. It won't take many slow db reads (get session, get a thing, get one more) to trash your server's response latency.
Nice article, but the replicated approach isn't exactly comparing like with like. To achieve the same semantics you'd need to block for a response from the remote backup servers which would end up with the same latency as the other cloud providers...
> The next major breakthrough in storage technology was the hard disk drive.<p>There were a few storage methods in between tape & HDDs, notably core memory & magnetic drum memory.
Hrm "unlimited IOPS"? I suppose contrasted against the abysmal IOPS available to Cloud block devs. A good modern NVMe enterprise drive is specced for (order of magnitude) 10^6 to 10^7 IOPS. If you can saturate that from database code, then you've got some interesting problems, but it's definitely not unlimited.
We are working on a platform that lets you measure this stuff with pretty high precision in real time.<p>You can check out our sandbox here:<p><a href="https://yeet.cx/play" rel="nofollow">https://yeet.cx/play</a>