I liked most of the piece, but some bits rubbed me the wrong way:<p>> I was taken by surprise by the fact that although every one of my peers is certainly extremely bright, most of them carried misconceptions about how to best exploit the performance of modern storage technology leading to suboptimal designs, even if they were aware of the increasing improvements in storage technology.<p>> In the process of writing this piece I had the immense pleasure of getting early access to one of the next generation Optane devices, from Intel.<p>The entire blog post is complaining about how great engineers have misconception about modern storage technology and yet to prove it the author had to obtain benchmarks from <i>early</i> access to <i>next-generation</i> devices...?! And to top it off, from this we conclude "the disconnect" is due to the <i>APIs</i>? Not, say, from the possibility that such blazing-fast components may very well not even <i>exist</i> in users' devices? I'm not saying the conclusions are wrong, but the logic surely doesn't follow... and honestly it's a little tasteless to criticize people's understanding if you're going to base the criticism on things they in all likelihood don't even have access to.
From the author's previous piece: <a href="https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/" rel="nofollow">https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-wi...</a><p>> Our CTO, Avi Kivity, made the case for async at the Core C++ 2019 event. The bottom line is this; in modern multicore, multi-CPU devices, the CPU itself is now basically a network, the intercommunication between all the CPUs is another network, and calls to disk I/O are effectively another. There are good reasons why network programming is done asynchronously, and you should consider that for your own application development too.
>
> It fundamentally changes the way Linux applications are to be designed: Instead of a flow of code that issues syscalls when needed, that have to think about whether or not a file is ready, they naturally become an event-loop that constantly add things to a shared buffer, deals with the previous entries that completed, rinse, repeat.<p>As someone that's been working on FRP related things for a while now, this feels very vindicating. :)<p>I few like as recently as a few years ago, the systems world was content with it's incremental hacks, but now the gap between the traditional interfaces and hardware realities has become too much, and bigger redesigning is afoot.<p>Excited for what emerges!
One thing I have started to realize is that best case latency of an NVMe storage device is starting to overlap with areas where SpinWait could be more ideal than an async/await API. I am mostly advocating for this from a mass parallel throughput perspective, especially if batching is possible.<p>I have started to play around with using LMAX Disruptor for aggregating a program's disk I/O requests and executing them in batches. This is getting into levels of throughput that are incompatible with something like what the Task abstractions in .NET enable. The public API of such an approach is synchronous as a result of this design constraint.<p>Software should always try to work with the physical hardware capabilities. Modern SSDs are most ideally suited to arrangements where all data is contained in an append-only log with each batch written to disk representing a consistent snapshot. If you are able to batch thousands of requests into a single byte array of serialized modified nodes, you can append this onto disk so much faster than if you force the SSD to make individual writes per new/modified entity.
Prior discussion from /r/rust where the author is present to answer questions: <a href="https://www.reddit.com/r/rust/comments/k16j6x/modern_storage_is_plenty_fast_it_is_the_apis_that/" rel="nofollow">https://www.reddit.com/r/rust/comments/k16j6x/modern_storage...</a>
I agree with the premise, but disagree with the conclusion.<p>For a little background, my first computer was a Mac Plus around 1985, and I remember doing file copy tests on my first hard drive (an 80 MB) at over 1 MB/sec. If I remember correctly, SCSI could do 5 MB/sec copies clear back in the mid-80s. So until we got SSD, hard drive speed stayed within the same order of magnitude for like 30 years (as most of you remember):<p><a href="http://chrislawson.net/writing/macdaniel/2k1120cl.shtml" rel="nofollow">http://chrislawson.net/writing/macdaniel/2k1120cl.shtml</a><p>So the time to take our predictable deterministic synchronous blocking business logic into the maze of asynchronous promise spaghetti was a generation ago when hard drive speeds were two orders of magnitude slower than today.<p>In other words, fix the bad APIs. Please don't make us shift paradigms.<p>Now if we want to talk about some kind of compiled or graph-oriented way of processing large numbers of files performantly with some kind of async processing internally, then that's fine. Note that this solution will mirror whatever we come up with for network processing as well. That was the whole point of UNIX in the first place, to treat file access and network access as the same stream-oriented protocol. Which I think is the motive behind taking file access into the same problematic async domain that web development is having to deal with now.<p>But really we should get the web back to the proven UNIX/Actor model way of doing things with synchronous blocking I/O.
I suppose _modern_ storage is fast, but how many servers are running on storage this modern? None of mine are and my work dev machine is still rocking a SATA 2.5" SSD.<p>We're probably still a few years off from being able to switch to this fast I/O yet. With the new game consoles switching over to PCIe SSDs I expect the price of NVMe drives to drop over the next few years until they're cheap enough that the majority of computers are running NVMe drives.<p>Even with SATA drives like mine though, there's really not that much performance loss from doing IO operations. I've run my OS with 8GiB of SSD swap in active use during debugging and while the stutters are annoying and distracting, the computer didn't grind to a halt like it would with spinning rust. Storage speed has increased massively in the last five years, for the love of god fellow developers, please make use of it when you can!<p>That said, deferring IO until you're done still makes sense for some consumer applications because cheap laptops are still being sold with hard drives and those devices are probably the minimum requirement you'll be serving.
Intuitively one should be able to approach the max speed for sequential reads via some tuning (queue/read_ahead_kb) even with the traditional, blocking posix interface. This would require a large enough read-ahead and large enough buffer size. Not poisoning the page cache/manually managing the page cache is an orthogonal issue and only relevant for some applications (and the additional memory copy barely makes a difference in OPs post).<p>One advantage of using high level (Linux) kernel interfaces is that this "automatically" gets faster with newer Linux versions without a need of large application level changes. Maybe in a few years we'll have an extra cache layer, or it stores to persistent memory now. Linux will (slowly) improve and your application with it. This won't happen if it is specifically tuned for Direct I/O with Intel Optane in 2020.<p>But yeah, random IO is (currently) another issue, and as said the usual advice is to avoid them. And with the old API this still holds. If one currently wants fast random IO one needs to use io_uring/aio (with Direct-IO) or just live with the performance not being optimal and hope that the page cache does more good than bad (like Postgresql).
I found this to be a good read, but I wish the author discussed the pros/cons of bypassing the file system and using a block device with direct I/O. I've found that with Optane drives the performance is high enough that the extra load from the file system (in terms of CPU) is significant. If the author was using a file system (which I assume is the case) which was it?
> ...misconceptions... Yet if you skim through specs of modern NVMe devices you see commodity devices with latencies in the microseconds range and several GB/s of throughput supporting several hundred thousands random IOPS. So where’s the disconnect?<p>Whoa there... let's not compare devices with 20+ GB/s and latencies in nanosecond ranges which translate to half a dozen giga-ops per second (aka RAM) with any kind of flash-based storage just yet.
This is such a big deal! The assumptions made when IO APIs where designed are so out-of-step with today's hardware that it really is a time to have a big rethink. In graphics, the last 20 years of API development have very much been focused on harnessing a GPU that have again and again outgrown the CPUs ability to feed it. So much have been learned, and we really need to apply this to both storage and networking.
Nice article. This part puzzled me though:<p>> The operating system reads data in page granularity, meaning it can only read at a minimum 4kB at a time. That means if you need to read read 1kB split in two files, 512 bytes each, you are effectively reading 8kB to serve 1kB, wasting 87% of the data read.<p>SSDs (whether SATA or NVMe) all read and write whole sectors at a time, right? I'm not sure what the sector size is, but 4 KiB seems like a reasonable guess. So I think you're reading the 8 KiB no matter what; it may just be a question of what layer you drop it at (right when it gets to the kernel or not). Also, doesn't direct IO require sector size-aligned operations?
This is why I am excited about DirectStorage, hoping it can usher a new era of IO APIs. We’ll see.<p><a href="https://devblogs.microsoft.com/directx/directstorage-is-coming-to-pc/" rel="nofollow">https://devblogs.microsoft.com/directx/directstorage-is-comi...</a>
This is a really poor article. Only in very rare circumstances can developers change the API's. API's are not "bad"; they are built to various important requirements. Only some of those requirements have to do with performance.<p>> <i>“Well, it is fine to copy memory here and perform this expensive computation because it saves us one I/O operation, which is even more expensive”.</i><p>"I/O operation" in fact refers to the API call, not to the raw hardware operation. If the developer measured this and found it true, how can it be a misconception?
It may be caused by a "bad" I/O API, but so what? The API is what it is.<p>API's provide one requirement which is stability: keeping applications working. That is king. You can't throw out API's every two years due to hardware advancements.<p>> <i>“If we split this into multiple files it will be slow because it will generate random I/O patterns. We need to optimize this for sequential access and read from a single file”</i><p>Though solid state storage doesn't have track-to-track seek times, the sequential-access-fast rule of thumb has not become false.<p>Random access may have to wastefully read larger blocks of the data than are actually requested by the application. The unused data gets cached, but if it's not going to be accessed any time soon, it means that something else got wastefully bumped out of the cache. Sequential access is likely to make use of an entire block.<p>Secondly, there is that API again. The underlying operating system may provide a read-ahead mechanism which reduces its own overheads, benefiting the application which structures its data for sequential access, even if there is no inherent hardware-level benefit.<p>If there is any latency at all between the application and the hardware, and if you can guess what the application is going to read next, that's an opportunity to improve performance. You can correctly guess what the application will read if you guess that it is doing a sequential read, and the application makes that come true.
John Ousterhout, in the RAMCloud project, bet that 2.5μs latency for a read across the network would be possible. NVMe-s seem to be inching towards that too - like < 20μs at this point. Would be interesting to see how these numbers play out.
AWS EBS gp2 is far from an NVMe drive.<p>Everyone architects for AWS these days. So the limitations of EBS still dominate I/O: limited IOPS, limited bandwidth.<p>Yes there are ephemerals. Which are basically little more than ramdisks/caches to actual AWS design.
Because I am a "glue" programmer, and I realize that all storage options suck, I've decided to wait on any infrastructure choices for now, and just use the filesystem as a key-value store when developing my projects.<p>When I need indexing, I use SQLite, but I limit myself to very basic subset of SQL that would work in any of Oracle, Maria, Microsoft stores without changes.
What I hate about Unix filesystems: the fact that you can't take a drive, put it in another computer and have permissions (user/group-ids) working instantly. Same for sharing over nfs.<p>Of course, people have tried to solve this, but I think not well enough. It's a huge amount of technical debt right there in the systems we use every day.
I think this is a loaded article and underscores the importance of low level engineers who understand your workload to guide purchasing. No longer will fringe benefits and bribes be enough.
I stopped reading this when it became evident that the author was just promoting a library he’d written. I might have been a tiny bit more interested if I were a Rust developer.
Also: Modern storage is plenty fast, but also not reliable for long term use.<p>That is why I buy a new SSD every year and clone my current (worn out) SSD to the new one. I have several old SSDs that started to get unhealthy, well, according to my S.M.A.R.T utility that I used to check them. I could probably get away with using an SSD for another year, but will not risk the data loss. Anyone else do this?