Faster than the filesystem (2021)

216 pointsby madmax108over 2 years ago

20 comments

ChuckMcMover 2 years ago

That a database is faster than a nominal file system has been known for quite a while. It can't quite replace them though. Microsoft tried really hard to have database for their root filesystem, investing a lot of time and effort but ultimately it fizzled, why? Important edge cases (like swap) that databases do really poorly.That said, if you're using a file system abstraction for complex and compound documents using a database is a really stellar way to go. In part because it doesn't have the "chunking" problem where allocation blocks are used for both data and metadata in file systems so you pick a size that is least bad for both, versus one "optimum" size for disk io's to keep the disk channel bandwidth highly utilized and the naming/chunking part as records inside that.I wrote a YAML <=> SQLite tool so that apps could use an efficient API to get at component types but the file could be "exported" to pure text. And it worked well in non-UTF string applications. (this was fine for the app which was orchestrating back end processes) At some point it would be interesting to move it to UTF-8 to see how that worked out.

评论 #34387799 未加载

评论 #34388181 未加载

评论 #34388164 未加载

评论 #34392434 未加载

评论 #34387917 未加载

评论 #34391926 未加载

评论 #34388005 未加载

评论 #34393293 未加载

评论 #34388288 未加载

polyrandover 2 years ago

As others have pointed out, that benchmark is from 2017. However, the "SQLite: Past, Present, and Future" paper[0] has an updated version of this benchmark (see section 4.3 Blob manipulation), and also compares it with DuckDB.Edit:Another thing that is sometimes forgotten when comparing SQLite to the filesystem is that files are hard[1]. It's not only about performance, but also about all the guarantees that you get "for free", and all the complexity you can remove from your codebase if you need ACID interactions with your files.[0]: <a href="https://www.vldb.org/pvldb/vol15/p3535-gaffney.pdf" rel="nofollow">https://www.vldb.org/pvldb/vol15/p3535-gaffney.pdf</a>[1]: <a href="https://danluu.com/file-consistency/" rel="nofollow">https://danluu.com/file-consistency/</a>

评论 #34387935 未加载

评论 #34389528 未加载

summerlightover 2 years ago

This reminds me of WinFS, which was probably one of the most ambitious architectural project ever in the history of Windows yet failed to materialize. The vision was so attractive, encode all the semantic knowledge of file schema and metadata into relational filesystem. So you can programmatically query on whatever information about the filesystem and its content.IIRC the problem was its performance. I don't have any insider knowledge so cannot pinpoint the culprit but I suppose that the performance issue was probably not something fundamental tradeoff (as this article suggests) but more of its immature implementation. The storage technologies got much better nowadays so many of its problem could be tackled differently. Of course the question it has to answer is also different; is it a still worth problem to solve?

评论 #34387794 未加载

评论 #34388300 未加载

评论 #34387929 未加载

mklover 2 years ago

On Windows in my experience it's at least a factor of 10. I worked on a script recently that reads ~20000 files a few kb each and extracts some info to generate a web page. I sped it up enormously just by putting the file contents into SQLite tables.

评论 #34387645 未加载

评论 #34387675 未加载

评论 #34387803 未加载

评论 #34391753 未加载

评论 #34387847 未加载

karmakazeover 2 years ago

The summary says:> SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database.This should come as a surprise to no one. The rest of the article is only of interest in terms of how benchmarking is done, if that's your thing.

mgover 2 years ago

Did they read and write all the data a filesystem handles?Off the top of my head, the typical filesystem stores:<pre><code> - content - creation time - modification time - last access time - read/write/execute permissions - owner - group - position in the dir hirarchy</code></pre>

评论 #34387798 未加载

评论 #34388645 未加载

scotty79over 2 years ago

Maybe node_modules could move to SQLite.

评论 #34392070 未加载

BartjeDover 2 years ago

In 2017 you didnt have io uring yet. Though that doesn't explain it for windows and android.

评论 #34389028 未加载

评论 #34387734 未加载

paulgbover 2 years ago

This could use (2017) (original benchmark date) or (2021) (last modified time) in the title.

cdbattagsover 2 years ago

Anyone else think maybe AWS (S3) has made this optimization already? Or would it just be a whole team of kernel engineers optimizing it there?The overhead on CPU cycles this would save cloud storage systems... Can someone help me quantify the potential savings?Edit:They specifically don't list their storage medium on their marketing:<a href="https://aws.amazon.com/s3/storage-classes/" rel="nofollow">https://aws.amazon.com/s3/storage-classes/</a>

评论 #34387638 未加载

评论 #34387787 未加载

评论 #34387899 未加载

评论 #34387662 未加载

detritesover 2 years ago

Title should have (2017)?

gnufxover 2 years ago

Long ago we had a contrasting experience. It had been assumed that the implemented "database" (not a general database) for storing nuclear spectroscopy data was needed because the filesystem was too slow. However, one of our "physicist programmers" decided to do the experiment, and found it wasn't so, and the system was re-designed around directories of files of spectra.On the other hand, the other facility in the lab had consulted Logica on storage of similar data, who viewed x-y (or multiple dependent variables) data as tables suitable for storage in their early RDB, Rapport. That wasn't actually used in production, and the storage format for the table model was unfortunately usually mangled by data acquisition systems writing files.

k__over 2 years ago

That claim sounds more preposterous than it actually is, considering that databases and filesystems are both just software using the same hardware in a different manner.

mpweiherover 2 years ago

More accurate title: "Operations within a file are faster than directory operations"

评论 #34388970 未加载

shrubbleover 2 years ago

25 years ago this was done/tried with Pgfs, an NFS server with Postgres underneath it... <a href="https://www.linuxjournal.com/article/1383" rel="nofollow">https://www.linuxjournal.com/article/1383</a>

eddsh1994over 2 years ago

Would it make sense to store images as blobs if I’m building a web app using SQLite then? Or is this specifically small images only? Saves having to do backups of data and images separately :)

评论 #34387606 未加载

评论 #34388142 未加载

评论 #34387714 未加载

enriqutoover 2 years ago

Is it possible to mount a database so that its rows (or columns, or whatever you have selected upon the call to mount) are accessible as regular files?

评论 #34388807 未加载

评论 #34389328 未加载

fexecveover 2 years ago

How does it do this? Does it cache changes and only write them every so often? Does it keep a file open to write uncommitted changes?

评论 #34393216 未加载

crabboneover 2 years ago

I'll start with an anecdote. Scroll down if you want just the conclusion.I work in test automation. In the general storage area. I worked for some years with a distributed filesystem (think something like Lustre, but modern and fast), then worked with something like DRBD, but, again more modern. Not sure how fast (never ran any benchmarks on DRBD). I had to deal with filesystems like Ceph's filesystem, BeeGFS...Anyways. When I worked on the DRBD analog, let's call the product "R", one of my tasks was to figure out how well would a database work on top of R. Well, "database" is a very broad term. I figured I'd concentrate on using couple of well-known relational databases. PostgreSQL turned out to have the most to offer in terms of insight into its performance. Next, I'd have to find a suitable configuration for the database. And that's where things got really complicated. To spare you the gory details: the more memory you can spare for the benefit of your database server -- the better, the more you can relax the requirement of synchronizing with persistent storage (i.e. fsync and friends) -- even better.In the end of the day, I had to abandon this kind of testing because, essentially, if given enough memory, or enough replicas (allows not to care about destaging to persistent storage) the bigger numbers I could produce, which made the question "how well does R compare to a plain block device?" irrelevant.---Fast-forward to this article. It gives out a vibe of "filesystems are not efficient, if you re-arrange something in the logic of doing I/O you can gain more performance!". And that really reads like "10 things doctors don't want you to know!" advertorial. It's similar to the misguided idea I often encounter with people who aren't system programmers that "mmap is faster".Now, understanding why "mmap is faster" is nonsense will help understand why benchmarking a database on top of a filesystem and comparing performance doesn't make a lot of sense. So, in order to properly compare the speed, we need to make sure we compare both good and bad paths. What happens when I/O errors occur when memory-mapping files and when using other system API? I invite you to explore this question on your own. Another question you need to ask yourself in this comparison is "how well does this process scale", on a single CPU, single block device, multiple CPUs single block device, single CPU multiple block devices... And on top of this, what if we consider Harvard architecture (to an extend, IBM's mainframes are it, at least there general I/O is separate from the rest of computing). In other words, what if our hardware knows "some tricks"? (Other examples include the kinds of drivers and protocols used to talk to the hardware and whether the storage software, even if running on top of a filesystem will know / be able to take advantage of these, i.e. NVMe allows big degree of parallelism, but will "mmap" be able to utilize that? Especially if multiple files are mapped at the same time?)And, of course, there is a difference between what's been actually done and what guarantees can be given about the state of data during and after the I/O completes. And these details will greatly depend on the details of the specific filesystem being used. For example, if you wanted to write to the device directly (as in with "ODIRECT | OSYNC"), but the filesystem is something like ZFS or Btrfs (i.e. it needs to checksum your data beside other things), then you might get confused about what you are actually comparing (direct I/O would imply less I/O than is actually necessary to give durability guarantees that may not be given by an alternative storage).---So, a better title for the article would have been "It's possible to carve out a use-case, where SQLite works faster than a similar process designed to only use system API to access a filesystem". Which is essentially saying: you, the programmer, don't know how to use a filesystem as well as we do... And, in my experience, most programmers are clueless when it comes to using filesystem or any other kind of storage really. So, that's not surprising, but is still not as sensationalist as the original title.

dan-robertsonover 2 years ago

I wonder how using SQLite like this compares to something like squashFS.

评论 #34387605 未加载