TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Investigating Linux phantom disk reads

302 pointsby kamarajuabout 2 years ago

8 comments

addisonjabout 2 years ago
I am going to write this comment with a large preface: I don&#x27;t think it is ever helpful to be an absolutist. For every best-practice&#x2F;&quot;right way&quot; to do things, there are circumstances when doing it another way makes sense. That can be a ton of reasons for that, be it technical, money&#x2F;time, etc. The best engineering teams aren&#x27;t those that just blindly follow what others say is a best practice but understand the options and make an informed choice. None of the following comment is at all commentary on questDB, as they mention in the article, <i>many</i> databases use similar tools.<p>With that said, after reading the first paragraph I immediately searched the article for &quot;mmap&quot; and had a good sense of where the rest of this was going. Put simply, it is just really hard to consider what the OS is going to do in all situations when using mmap. Based on my experience, I would guess that a <i>ton</i> of people reading this comment have hit issues that, I would argue, is due to using mmap. (Particularly looking at you prometheus).<p>All things told, this is a pretty innocuous incident of mmap causing problems, but I would encourage any aspiring DB engineers to read <a href="https:&#x2F;&#x2F;db.cs.cmu.edu&#x2F;mmap-cidr2022" rel="nofollow">https:&#x2F;&#x2F;db.cs.cmu.edu&#x2F;mmap-cidr2022</a> as it gives a great overview of the range of problems that can occur when using mmap<p>I think some would argue that mmap is &quot;fine&quot; for append only workloads (and is certainly more reasonable compared to a DB with arbitrary updates) but even here, lots of factors like metadata, scaling number of tables, etc will <i>eventually</i> bring you to hit some fundamental problems when using mmap.<p>The interesting opportunity in my mind, especially with improvements in async IO (both at FS level and in tools like rust), is to build higher level abstractions that bring the &quot;simplicity&quot; of mmap, but with more purpose-built semantics ideal for databases.
评论 #35796867 未加载
评论 #35795532 未加载
评论 #35796401 未加载
pengaruabout 2 years ago
Going through mmap for bulk-ingest sucks because the kernel has to fault in the contents to make what&#x27;s in-core reflect what&#x27;s on-disk before your write access to the mapped memory occurs. It&#x27;s basically a read-modify-write pattern even when all you intended to do was write the entire page.<p>When you just use a write call you provide a unit of arbitrary size, and if you&#x27;ve done your homework that size is a multiple of page size and the offset page-aligned. Then there&#x27;s no need for the kernel to load anything in for the written pages; you&#x27;re providing everything in the single call. Then you go down the O_DIRECT rabbithole every fast linux database has historically gone down.
评论 #35798248 未加载
评论 #35795740 未加载
dmazinabout 2 years ago
Am I the only one surprised to read that this database relies on periodic flushing (every 30s by default) with no manual syncs at all? I guess it’s metrics so 30s of data loss is fine? I dunno about that. Data loss is usually due to a power failure, and the metrics collected right before a power failure are important.
评论 #35796629 未加载
评论 #35796713 未加载
评论 #35799607 未加载
davidhydeabout 2 years ago
Seems like using memory mapped files for a write-only load is the sub optimal choice. Maybe I’m mistaken but surely using an append-only file handle would be simpler than changing the behaviour of how memory mapped files are cached like they did for their solution?
rajnathaniabout 2 years ago
I know sharing ChatGPT&#x2F;GPT&#x2F;AI generated text in comments here can be unappealing, but I would like to share this one as I feel that I managed to get ChatGPT-4 to summarize this article using a non-computer analogy pretty well:<p>&quot;Imagine you work in a library where you store books on shelves. Your primary task is to take new books and put them on the shelves (write-only load). You don&#x27;t expect to read the books often, so the number of times you need to open and read the books should be minimal.<p>One day, you notice that several books are being opened and read more often than expected, even though your main task is to put away new books. This is confusing and unexpected, so you start investigating why this is happening.<p>After some investigation, you find out that the library assistant (the operating system) is trying to be helpful by anticipating which books might be needed next and opening them ahead of time (readahead). This anticipation works well when there is plenty of shelf space (memory) available. However, when the library gets crowded (memory pressure), the assistant starts anticipating the wrong books, causing unnecessary book openings (phantom reads).<p>To resolve this issue, you tell the library assistant to stop anticipating which books to open (disabling readahead) when you&#x27;re just putting away new books. This solves the problem and reduces the number of unnecessary book openings. The experience teaches you the importance of understanding how the library assistant works and shows that addressing unexpected issues can lead to improvements in the overall library system.&quot;
sytseabout 2 years ago
TLDR; &quot;Ingestion of a high number of column files under memory pressure led to the kernel starting readahead disk read operations, which you wouldn&#x27;t expect from a write-only load. The rest was as simple as using madvise in our code to disable the readahead in table writers.&quot;
评论 #35794968 未加载
0xbadcafebeeabout 2 years ago
There are other methods you can use to increase performance under memory pressure, but you&#x27;d end up handling i&#x2F;o directly and maintaining your own index of memory and disk accesses, page-aligned reads&#x2F;writes, etc. It would be easier to just require your users buy more memory, but when there&#x27;s a hack like this available, that seems preferable to implementing your own VMM and disk i&#x2F;o subsystem.
speedgooseabout 2 years ago
&gt; It&#x27;s also important to note that the above percentages…<p>Has this article being written using ChatGPT by any chance ?