SQLite briefing for Linux kernel hackers (2019)

301 pointsby symisc_develover 4 years ago

13 comments

webmobdevover 4 years ago

The simple, detailed and extensive documentation is one of the reason why I love SQLite and probably a major reason for its success as an open source project. There are so many open source projects that don't even provide developers with the architecture of the project or even a simple compile guide with the list of required dependencies. (In general, I've observed that BSD-licensed open source projects tend to be better documented. Wonder why?)

评论 #25168677 未加载

harry8over 4 years ago

So the os is providing an abstraction so you don't need to know the underlying hardware, filesystems etc. Sqlite is all like, "yeah but we need to know what you're abstracting or everything is terrible."This seems like a common problem in general.Abstraction to a common interface is fine so if you don't care what the abstracted thing is you don't need to. Great. Read a basic file off spinning disk, ssd, cdrom, nfs - all the same, all good.But if you do care you can't know what it really is because they've abstracted the underlying away with whatever layer(s) sit on top.Should it be a question we need to consider when writing libraries, or any other abstraction, how does one penetrate your abstraction if they need to? Are you really, completely 100% sure they don't and shouldn't?There's a kind of arrogance that we fall into making decisions for others when we design APIs, library interfaces, etc where we know what we're exposing and users don't know what's good for them. Maybe that's right sometimes, i don't know?This sqlite paper casts some pretty strong doubt on that as an approach, imho.What do you think? Have we all been missing something like this in api design?

评论 #25170096 未加载

评论 #25170693 未加载

评论 #25169911 未加载

评论 #25169757 未加载

评论 #25170439 未加载

评论 #25173536 未加载

评论 #25170058 未加载

评论 #25170046 未加载

rkagererover 4 years ago

Today I learned:A typical Android phone has hundreds of SQLite databases and does more than 5 gigabytes of database I/O per day....F2FS is a log-structured filesystem for Linux that has (limited) atomic write capabilities. SQLite is able to use the atomic write capabilities of F2FS to bypass the rollback journal. Anecdotal reports are that an Android phone that is reformatted to use F2FS is noticeably faster.F2FS make an old tired phone feel like a perky new phone.

评论 #25168231 未加载

malwrarover 4 years ago

I’m really impressed with how... humble this briefing is. It has no prelude, no author introduction, no ego whatsoever—-just short descriptions of SQLite behavior and a proposal to make it faster. I’m feel like this format would be amazing if combined with hyperlinks to places I could find detailed info or code.

评论 #25169177 未加载

macdiceover 4 years ago

> Are sector writes atomic? And if so, for what size sectors?As a PostgreSQL hacker I have the same question, and I know the MySQL people do too, and I wrote some more specific questions and backgound, over here: <a href="https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO" rel="nofollow">https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO</a>

评论 #25169187 未加载

trevynover 4 years ago

This document is from 2019 and contains a number of questions posed to the kernel hackers/filesystem developers, for example:>Pages are sorted so that they are written in increasing order. → Is this helpful on Linux? Or can we just as well skip the sort and write the pages in any arbitrary order?Does anyone know if there are responses to these questions available anywhere?Edit: The recorded talk is available here: <a href="https://www.youtube.com/watch?v=-oP2BOsMpdo&t=5541" rel="nofollow">https://www.youtube.com/watch?v=-oP2BOsMpdo&t=5541</a> but it's a lightning talk and the discussion was very brief.

评论 #25168408 未加载

评论 #25168986 未加载

jorangreefover 4 years ago

"Are sector writes atomic? And if so, for what size sectors?"For a database, this is not the question you want to be asking. It's a rabbit hole. The right answer (which is not always the answer we want, so we keep asking): no one knows.It's better design to err on the side of assuming disk sector writes are never atomic. This would be taking the security stance, e.g. assume an attacker is actively trying to break your protocol, and use cryptographic guarantees to prevent or at least detect them.Kernel developers also can't answer these kinds of hardware questions. There's too much hardware out there.Disks may give atomicity guarantees and then you discover edge cases where these degrade, or disks may emulate Advanced Format 4096 byte logical sectors but have smaller 512 byte physical sectors underneath. Disks are too complicated (and firmware too buggy) to trust any kind of guarantee. There are too many layers of abstraction.Instead of trying to figure out the safety of the layer beneath you, assume the worst and bring the design back to using end-to-end guarantees.If you assume and plan for the worst, then you don't need to ask the question, and you can handle the worst of hardware, without surprises when those guarantees are broken.In fact, with file systems in general, the situation is even worse than for disks, because of disk corruption (3% per 36 month period per disk) and misdirected writes, which most file systems unfortunately don't handle. You should never trust any metadata provided by the file system about your journal file (e.g. file size) and use that as part of your recovery protocol, because the file system itself is storing this kind of metadata on the same disk you're writing too. Again, something as important as the metadata size of your journal file should be protected by your own end-to-end protocol.

评论 #25171779 未加载

jlokierover 4 years ago

This is a lovely document, typical of SQLite authors and why it's such a great piece of software.If they are going to ask kernel filesystem authors to provide a better API, I have long felt there is one obvious API type which is missing from Linux and would be better for filesystem databases than other approaches:<pre><code> fdatasync_rangev() </code></pre> That is, like fdatasync() (including committing size & filesystem structure blocks needed to retrieve committed data later), but only committing the data within a set of byte ranges, returning when that's done. So that it doesn't have to wait for all the rest of the data to be written.That would allow the WAL or journal to be efficiently part of the database file itself. No need for extra files, file creation and deletion, directory operations or syncing directories. That's a bunch of durable metadata writes that can be removed, and filesystem implementation details that can be avoided.It would also allow atomic and durable writes without the need for writing twice in some cases, depending how the data is structured. (First write to the WAL or rollback journal, then to the database main file). That's a bunch of data writes avoided.In cases where it can avoid two writes, that would also remove the intermediate barrier sync, doubling the commit-to-sync throughput.Ideally like other I/O (or at least reads) that should be available in a sensible async manner too, with completion when the covered data is committed durably, and only that data. I'm not sure what system API would provide good async I/O for something like SQLite though, if it's not able to use AIO or ui_uring due to kernel API being poorly suited to a self-contained library.Finally, a couple of variant (probably via flags)."No rush" flag, to let the caller wait until those ranges are committed durably, but not force them to be written faster than normal. That would allow ordinary fast buffered-write speed, while at the same time providing the usual ACID guarantees that a DB returns success from COMMIT when the data is durably committed. For some workloads that would be fine."Barrier" flag, make the call return immediately but delay all subsequent writes on the same filedescriptor until the ranges are synced durably. This is similar to what's mentioned in the article, but more versatile by attaching to ranges. It's also not strictly necessary if you have the "no rush" flag.

评论 #25173246 未加载

foftover 4 years ago

I’d really love SQLite to work on the nfs and cifs. Is there anything that can be done to make that safe at a kernel level?

评论 #25168313 未加载

评论 #25168502 未加载

评论 #25168360 未加载

jlokierover 4 years ago

I'm very surprised that SQLite trusts fdatasync() when appending to a file, rather than fsync(), given SQLite is written to be so portable and cautious and has been around a long time.I've always been suspicious that fdatasync() does not promise to durably commit the size metadata (and perhaps indirection blocks) on some OSes/versions, just the data blocks itself. Such that it's only safe to use fdatasync() on already allocated regions of the file. That was my understanding of fdatasync() when I first read about it, and I vaguely recall some DB handled this by rarely appending to files, instead doing it occasionally by pausing writes (to that file), appending a large block of zeros to the file then fsync() to commit, then resuming writes.It's not just me thinking this. From a quick Google:> <a href="https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/" rel="nofollow">https://www.percona.com/blog/2018/02/08/fsync-performance-st...</a>> "The fsync system is not the only system call that persists data to disk. There is also the fdatasync call. fdatasync persists the data to disk but does not update the metadata information like the file size and last update time. Said otherwise, it performs one write operation instead of two."

评论 #25174237 未加载

评论 #25171756 未加载

yencabulatorover 4 years ago

> SQLite databases are commonly used as storage containers for sending structured content across the internet.Can SQLite databases still attack the host that opens them?<a href="https://research.checkpoint.com/select-code_execution-from-using-sqlite/" rel="nofollow">https://research.checkpoint.com/select-code_execution-from-u...</a>

elsjaakoover 4 years ago

Is there any database that just defines their own partition type? It always seemed strange to me that databases are so optimized, but still operate on top of the filesystem abstraction.

评论 #25171228 未加载

评论 #25169953 未加载

arendtioover 4 years ago

Does someone know if there is something like an SQLite server, so that it could be used like a 'normal' database server?Why? Well, for benchmarking for example, or if the application you are using has a bad multi-user implementation and you don't want to migrate to another database...

评论 #25170150 未加载

评论 #25168759 未加载

评论 #25170887 未加载

评论 #25169541 未加载

评论 #25168682 未加载

评论 #25169302 未加载

评论 #25170446 未加载

13 comments

webmobdevover 4 years ago

评论 #25168677 未加载

harry8over 4 years ago

评论 #25170096 未加载

评论 #25170693 未加载

评论 #25169911 未加载

评论 #25169757 未加载

评论 #25170439 未加载

评论 #25173536 未加载

评论 #25170058 未加载

评论 #25170046 未加载

rkagererover 4 years ago

评论 #25168231 未加载

malwrarover 4 years ago

评论 #25169177 未加载

macdiceover 4 years ago

评论 #25169187 未加载

trevynover 4 years ago

评论 #25168408 未加载

评论 #25168986 未加载

jorangreefover 4 years ago

评论 #25171779 未加载

jlokierover 4 years ago

评论 #25173246 未加载

foftover 4 years ago

I’d really love SQLite to work on the nfs and cifs. Is there anything that can be done to make that safe at a kernel level?

评论 #25168313 未加载

评论 #25168502 未加载

评论 #25168360 未加载

jlokierover 4 years ago

评论 #25174237 未加载

评论 #25171756 未加载

yencabulatorover 4 years ago

elsjaakoover 4 years ago

Is there any database that just defines their own partition type? It always seemed strange to me that databases are so optimized, but still operate on top of the filesystem abstraction.