IO Devices and Latency

443 pointsby milar2 months ago

28 comments

bddicken2 months ago

Author of the blog here. I had a great time writing this. By far the most complex article I've ever put together, with literally thousands of lines of js to build out these interactive visuals. I hope everyone enjoys.

评论 #43494855 未加载

评论 #43355661 未加载

评论 #43357647 未加载

评论 #43355518 未加载

评论 #43355649 未加载

评论 #43361760 未加载

评论 #43357898 未加载

评论 #43357153 未加载

评论 #43380636 未加载

评论 #43356205 未加载

评论 #43355665 未加载

评论 #43357613 未加载

评论 #43359188 未加载

评论 #43364155 未加载

评论 #43355433 未加载

bob10292 months ago

I've been advocating for SQLite+NVMe for a while now. For me it is a new kind of pattern you can apply to get much further into trouble than usual. In some cases, you might actually make it out to the other side without needing to scale horizontally.Latency is king in all performance matters. Especially in those where items must be processed serially. Running SQLite on NVMe provides a latency advantage that no other provider can offer. I don't think running in memory is even a substantial uplift over NVMe persistence for most real world use cases.

评论 #43356554 未加载

评论 #43357519 未加载

评论 #43356834 未加载

评论 #43355989 未加载

评论 #43356222 未加载

magicmicah852 months ago

Can I just say that I love how informative this was that I completely forgot it was to promote a product? Excellent visuals and interactivity.

robotguy2 months ago

Seeing the disk IO animation reminded me of Melvin Kaye[0]:<pre><code> Mel never wrote time-delay loops, either, even when the balky Flexowriter required a delay between output characters to work right. He just located instructions on the drum so each successive one was just past the read head when it was needed; the drum had to execute another complete revolution to find the next instruction. </code></pre> [0] <a href="https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/The_Story_of_Mel.html" rel="nofollow">https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/The...</a>

评论 #43357262 未加载

jhgg2 months ago

Metal looks super cool, however at my last job when we tried using instance local SSD's on GCP, there were serious reliability issues (e.g. blocks on the device losing data). Has this situation changed? What machine types are you using?Our workaround was this: <a href="https://discord.com/blog/how-discord-supercharges-network-disks-for-extreme-low-latency" rel="nofollow">https://discord.com/blog/how-discord-supercharges-network-di...</a>

评论 #43356508 未加载

gz092 months ago

Nice blog. There is also a problem that generally cloud storage is "just unusually slow" (this has been noted by others before, but here is a nice summary of the problem <a href="http://databasearchitects.blogspot.com/2024/02/ssds-have-become-ridiculously-fast.html" rel="nofollow">http://databasearchitects.blogspot.com/2024/02/ssds-have-bec...</a>)Having recently added support for storing our incremental indexes in <a href="https://github.com/feldera/feldera" rel="nofollow">https://github.com/feldera/feldera</a> on S3/object storage (we had NVMe for longer due to obvious performance advantages mentioned in the previous article), we'd be happy for someone to disrupt this space with a better offering ;).

评论 #43356779 未加载

__turbobrew__2 months ago

I think something about distributed storage which is not appreciated in this article:1. Some systems do not support replication out of the box. Sure your cassandra cluster and mysql can do master slave replication, but lots of systems cannot.2. Your life becomes much harder with NVME storage in cloud as you need to respect maintenance intervals and cloud initiated drains. If you do not hook into those system and drain your data to a different node, the data goes poof. Separating storage from compute allows the cloud operator to drain and move around compute as needed and since the data is independent from the compute — and the cloud operator manages that data system and draining for that system as well — the operator can manage workload placements without the customer needing to be involved.

评论 #43356792 未加载

评论 #43359703 未加载

评论 #43356981 未加载

评论 #43356776 未加载

tonyhb2 months ago

This is really cool, and PlanetScale Metal looks really solid, too. Always a huge sucker for seeing latency huge latency drops on releases: <a href="https://planetscale.com/blog/upgrading-query-insights-to-metal" rel="nofollow">https://planetscale.com/blog/upgrading-query-insights-to-met...</a>.

CSDude2 months ago

For years, I just didn't get why replicated databases always stick with EBS and deal with its latency. Like, replication is already there, why not be brave and just go with local disks? At my previous orgs, where we ran Elasticsearch for temporary logs/metrics storage, I proposed we do exactly that since we didn't even have major reliability requirements. But I couldn't convince them back then, we ended up with even worse AWS Elasticsearch.I get that local disks are finite, yeah, but I think the core/memory/disk ratio would be good enough for most use cases, no? There are plenty of local disk instances with different ratios as well, so I think a good balance could be found. You could even use local hard disk ones with 20TB+ disks for implementing hot/cold storage.Big kudos to the PlanetScale team, they're like, finally doing what makes sense. I mean, even AWS themselves don't run Elasticsearch on local disks! Imagine running ClickHouse, Cassandra, all of that on local disks.

评论 #43360477 未加载

评论 #43380437 未加载

ucarion2 months ago

Really, really great article. The visualization of random writes is very nicely done.On:> Another issue with network-attached storage in the cloud comes in the form of limiting IOPS. Many cloud providers that use this model, including AWS and Google Cloud, limit the amount of IO operations you can send over the wire. [...]> If instead you have your storage attached directly to your compute instance, there are no artificial limits placed on IO operations. You can read and write as fast as the hardware will allow for.I feel like this might be a dumb series of questions, but:1. The ratelimit on "IOPS" is precisely a ratelimit on a particular kind of network traffic, right? Namely traffic to/from an EBS volume? "IOPS" really means "EBS volume network traffic"?2. Does this save me money? And if yes, is from some weird AWS arbitrage? Or is it more because of an efficiency win from doing less EBS networking?I see pretty clearly putting storage and compute on the same machine strictly a latency win, because you structurally have one less hop every time. But is it also a throughput-per-dollar win too?

评论 #43356320 未加载

评论 #43356276 未加载

myflash132 months ago

If this is true, then how do "serverless" database providers like Neon advertise "low latency" access? They use object storage like S3, which I imagine is an order of magnitude worse than networked storage for latency.edit: apparently they build a kafkaesque layer of caching. No thank you, I'll just keep my data on locally attached NVMe.

评论 #43380548 未加载

vessenes2 months ago

Great nerdbaiting ad. I read all the way to the bottom of it, and bookmarked it to send to my kids if I feel they are not understanding storage architectures properly. :)

评论 #43355710 未加载

pjdesno2 months ago

I love the visuals, and if it's ok with you will probably link them to my class material on block devices in a week or so.One small nit: > A typical random read can be performed in 1-3 milliseconds.Um, no. A 7200 RPM platter completes a rotation in 8.33 milliseconds, so rotational delay for a random read is uniformly distributed between 0 and 8.33ms, i.e. mean 4.16ms.>a single disk will often have well over 100,000 tracksBy my calculations a Seagate IronWolf 18TB has about 615K tracks per surface given that it has 9 platters and 18 surfaces, and an outer diameter read speed of about 260MB/s. (or 557K tracks/inch given typical inner and outer track diameters)For more than you ever wanted to know about hard drive performance and the mechanical/geometrical considerations that go into it, see <a href="https://www.msstconference.org/MSST-history/2024/Papers/msst24-1.1.pdf" rel="nofollow">https://www.msstconference.org/MSST-history/2024/Papers/msst...</a>

评论 #43357336 未加载

jgalt2122 months ago

Disk latency, and one's aversion to it, is IMHO the only way Hetzner costs can run up on you. You want to keep the database on local disk, and not their very slow attached Volumes (Hetzner EBS). In short, you can have relatively light work-loads that will be on sort of expensive VMs because you need 500GB, or more, of local disk. 1TB local disk is the biggest VM they offer in the US. 300 EUR a month.

rsanheim2 months ago

That great infographic at the top illustrates one big reason why 'dev instances in the cloud' is a bad idea.

cmurf2 months ago

Plenty of text but also many cool animations. I'm a sucker for visual aids. It's a good balance.

carderne2 months ago

I'm always curious about latency for all these newdb offerings like PlanetScale/Neon/Supabase.It seems like they don't emphasise strongly enough _make sure you colocate your server in the same cloud/az/region/dc as our db. I suspect a large fraction of their users don't realise this, and have loads of server-db traffic happening very slowly over the public internet. It won't take many slow db reads (get session, get a thing, get one more) to trash your server's response latency.

cynicalsecurity2 months ago

That was a cool advertisement, I must give them that.

anonymousDan2 months ago

Nice article, but the replicated approach isn't exactly comparing like with like. To achieve the same semantics you'd need to block for a response from the remote backup servers which would end up with the same latency as the other cloud providers...

bloopernova2 months ago

Fantastic article, well explained and beautiful diagrams. Thank you bddicken for writing this!

评论 #43355953 未加载

SAI_Peregrinus2 months ago

> The next major breakthrough in storage technology was the hard disk drive.There were a few storage methods in between tape & HDDs, notably core memory & magnetic drum memory.

samwho2 months ago

Gosh, this is beautiful. Fantastic work, Ben. <3

gozzoo2 months ago

Can someeone share their expirience in creating such diagrams. What libraries and tools can be useful for such interactive diagrams?

评论 #43356024 未加载

评论 #43356019 未加载

aftbit2 months ago

Hrm "unlimited IOPS"? I suppose contrasted against the abysmal IOPS available to Cloud block devs. A good modern NVMe enterprise drive is specced for (order of magnitude) 10^6 to 10^7 IOPS. If you can saturate that from database code, then you've got some interesting problems, but it's definitely not unlimited.

评论 #43357601 未加载

TheAnkurTyagi2 months ago

Very nice animations.

r3tr02 months ago

We are working on a platform that lets you measure this stuff with pretty high precision in real time.You can check out our sandbox here:<a href="https://yeet.cx/play" rel="nofollow">https://yeet.cx/play</a>

liweixin2 months ago

Amazing! The visualizations are so great!

dangoodmanUT2 months ago

what local nvme is getting 20us? Nitro?