TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The No-Order File System (2012)

129 pointsby harporoederover 4 years ago

11 comments

Animatsover 4 years ago
That&#x27;s progress.<p>I&#x27;ve previously suggested that operating systems should have stronger file integrity guarantees. &quot;Unit&quot; files (rewriting replaces the whole file atomically, no reader ever sees a partially written file). That&#x27;s the default. &quot;Log&quot; files (always end at a clean end point, don&#x27;t tail off into junk). &quot;Temp&quot; files (disappear on reboot). And, for databases, &quot;Managed&quot; files.<p>Managed files have more I&#x2F;O functions. In particular, you get two completion events on writes - &quot;copy complete&quot; (the caller can reuse the buffer) and &quot;safely stored&quot; (the data has reached its final resting place, all links are complete, etc.). Programs like databases would use that. Those are the semantics databases want, and struggle to get by flushing, waiting, and various workarounds.<p>When I mention this, what usually happens is that people get lost in complicated workarounds for simulating unit files. Different approaches are needed for Linux, Windows, NTFS, and various VM systems. This should Just Work.<p>This isn&#x27;t my invention; it&#x27;s from Popek&#x27;s kernel in 1985 at UCLA, later seen as UCLA-Locus and as an IBM product. They had explicit commit and revert functions for file systems. I&#x27;d suggest having the default be commit on normal close or normal program exit, but if the program aborts or crashes or is killed, unit files don&#x27;t commit and remain unchanged.
评论 #25923776 未加载
评论 #25921183 未加载
评论 #25921070 未加载
评论 #25921261 未加载
评论 #25920634 未加载
alextheparrotover 4 years ago
Remzi and Andrea [0] have a pretty well regarded OS book used to teach the OS course at Wisconsin [1-2]. Hope they’re doing well, Andrea’s course teaching Scratch to 4th and 5th graders at a local elementary school was a highlight of my time there.<p>[0] I would use honorifics, but they are married which makes it a bit confusing. [1] Which I never took but was well regarded when I was an undergrad<p>[2] <a href="http:&#x2F;&#x2F;pages.cs.wisc.edu&#x2F;~remzi&#x2F;OSTEP&#x2F;" rel="nofollow">http:&#x2F;&#x2F;pages.cs.wisc.edu&#x2F;~remzi&#x2F;OSTEP&#x2F;</a>
评论 #25919951 未加载
pwinnskiover 4 years ago
It&#x27;s a fascinating concept, but nothing seems to have happened in the eight years since.<p>This seems like it might gain a bit of performance in exchange for slightly-higher filesystem overhead, though it&#x27;s not clear there are any stats on either. It&#x27;s also not clear how exactly performance increases would surface: are reads slower but writes faster?
评论 #25919661 未加载
tytsoover 4 years ago
I can think of a number of reasons why this may not have gotten a lot of traction in the intervening eight years. One is that DIF&#x2F;DIX disks are not easily available at reasonable prices.<p>The other potential shortcoming is that it provides substantially weaker consistency guarantees than what people are used to. After all of the scan threads are finished with the recovery, yes, the file system metadata will be self-consistent; but that&#x27;s all you can count upon.<p>Suppose that a particular file is getting updated at the time of the crash. There might be several blocks that were newly allocated right before the crash, and the a larger number of data blocks that were getting overwritten right before the crash. There is no guarantee which set of data blocks will be persistent across the reboot, and which newly allocated blocks will actually be attached to the file. There might also be newly allocated blocks that were attached to the file, but the data might not be written to the block, such that stale data (the previous contents of the block, which mgiht be another user&#x27;s medical data, or private e-mail, etc.) that would become visible to the file across the crash.
peter_d_shermanover 4 years ago
I got an idea when reading this for a file system for rotational (aka, &quot;spinning rust&quot;) old-skool hard drives (which are probably pretty dated by this point!).<p>Basically, you create a B-tree.<p>Everything except the leaf nodes (that is, the root and all of the internal nodes) are stored outwardly from the center track of the hard drive, so if that&#x27;s track 40, then it would be 40 then 39, then 41, then 38, then 42, etc.<p>Now, each leaf node is stored on the start of each track, and represents all of the blocks (after the blocks it itself takes up) on that track.<p>Now, when the OS needs to write or delete a block from a file, then the OS repositions the head to that track, writes the block, and then updates the leaf node (it&#x27;s already on the same track -- so it&#x27;s faster than moving the hard drive head to another track to do the update).<p>Yes, the rest of the B-Tree (on other tracks) might need to be updated as well, but if you were writing a whole bunch of blocks to an empty track at the same time, I think it could save some time...<p>If we were writing to a pre-existing file, and the B-Tree already had that track&#x27;s leaf-node included in its pointers (that is, let&#x27;s say that we&#x27;re already writing to a file that already had one block on that track), then all we&#x27;d have to do is:<p>1) Move the hard drive head to that track<p>2) Write the new block<p>3) Update the leaf node of the B-Tree to say that the block is now part of the file.<p>It gets even better.<p>If there&#x27;s a single block of the file on that track, and the track is otherwise empty, then if we have enough data to fill the track, then write all of the blocks, and cache writing to the B-Tree until all of that&#x27;s done... in other words, a single write!<p>B-Tree entries are a key comprising FileID + DiskBlockID.<p>Files are ordered in DiskBlockID order.<p>Should be extremely fast -- for old spinning hard drives...<p>(Yes, I know... nobody uses spinning hard drives anymore! &lt;g&gt;)
评论 #25925660 未加载
gray_-_wolfover 4 years ago
&gt; Sorry! The URL you requested was not found on our server.<p>The link seems to be dead :&#x2F;
评论 #25920841 未加载
ro_bitover 4 years ago
Title could use a (2012)
jzer0coolover 4 years ago
&gt; Modern file systems use ordering points to maintain consistency in the face of system crashes.<p>Could someone shed more light? In modern or older O&#x2F;S, what exactly happens during the crash?
评论 #25920927 未加载
tentacleunoover 4 years ago
Didn&#x27;t Plan9 do something like this?
评论 #25920485 未加载
not2bover 4 years ago
The publications they point to are from 2012 and 2013. Did this work ever go anywhere after that?
Ericson2314over 4 years ago
All I ask from storage: please give me a b-epsilon tree for content-addressed data. I will handle naming, mutation, etc. myself.<p>The number one problem is all the current abstractions are stupefied, with everyone conway&#x27;s law-ing around everyone else.