What a great and valuable post, especially since this info is the result of talking to the APFS team at WWDC, and has not been published anywhere else yet.<p>Of particular interest (to me) was the "Checksums" section:<p><pre><code> Notably absent from the APFS intro talk was any mention of
checksums....APFS checksums its own metadata but not user data.
...The APFS engineers I talked to cited strong ECC protection
within Apple storage devices. Both flash SSDs and magnetic media
HDDs use redundant data to detect and correct errors. The
engineers contend that Apple devices basically don’t return
bogus data.
</code></pre>
That is utterly disappointing. SSDs have internal checksums, sure, but there are so many different ways and different points at which a bit can be flipped.<p>It's hard for me to imagine a worse starting point to conceive a new filesystem than "let's assume our data storage devices are perfect, and never have any faulty components or firmware bugs".<p>ZFS has a lot of features, but data integrity is <i>the</i> feature.<p>I get that maybe a checksumming filesystem could conceivably be too computationally expensive for the little jewelry-computers Apple is into these days, but it's a terrible omission on something that is supposed to be the new filesystem for macOS.
> I get that maybe a checksumming filesystem could conceivably be too computationally expensive for the little jewelry-computers Apple is into these days, but it's a terrible omission on something that is supposed to be the new filesystem for macOS.<p>Checksumming has another cost that isn't immediately obvious. Suppose you write to a file and the writes are cached. Then the filesystem starts to flush to disk. On a conventional filesystem, you can keep writing to the dirty page while the disk DMAs data out of it. On a checksumming filesystem, you can't: you have to compute the checksum and then write out data consistent with the checksum. This means you either have to delay user code that tries to write, or you have to copy the page, or you need hardware support for checksumming while writing.<p>On Linux, this type of delay is called "stable pages", and it <i>destroys</i> performance on some workloads on btrfs.
Slightly worried by the vibe that comes off this. "I asked him about looking for inspiration in other modern file systems ... he was aware of them, but didn’t delve too deeply for fear, he said, of tainting himself". And (to paraphrase): 'bit-rot? What's that?'.<p>I would have hoped that a new filesystem with such wide future adoption would have come from a roomful of smart people with lots of experience of (for example) contributing to various modern filesystems, understanding their strengths and weaknesses, and dealing with data corruption issues in the field. This doesn't come across that way at all.
I'm extremely confused by this:<p>> With APFS, if you copy a file within the same file system (or possibly the same container; more on this later), no data is actually duplicated. [...] I haven’t see this offered in other file systems [...]<p>To my knowledge, this is what cp --reflink does on GNU/Linux on a supporting filesystem, most notably btrfs, and has been doing by default in newer combinations of the kernel and GNU coreutils.<p>This guy seems too well-informed and experienced in the domain to miss something so obvious, though. So what am I missing?<p>Also interesting to me is the paragraph about prioritizing certain I/O requests to optimize interactive latency: On Linux this is done by the I/O scheduler, exchangable and agnostic to the filesystem. Perhaps greater insight into the filesystem could aid I/O scheduling (this has been the argument for moving RAID code into filesystems as well, though, which APFS opts against) -- hearing a well-informed opinion on this point would be interesting. Unless this post gets it wrong and I/O scheduling isn't technically implemented in APFS either.<p>It <i>seems</i> like this perspective might be one written from within a Solaris/ZFS bubble and further hamstrung by macOS' closed-source development model. Which is interesting in light of the Giampaolo quote about intentionally not looking closely at the competition, either.
In my opinion, APFS does not seem to improve upon ZFS in several key areas (compression, sending/receiving snapshots, dedup, etc.). Apple is reimplementing many features already implemented in OpenZFS, btrfs (which itself reimplemented a lot of ZFS features), BSD HAMMER, etc.<p>Maybe extending one of these existing filesystems to add any functionality Apple needs on top of its existing features (and, hopefully, contributing that back to the open source implementation) would cost more person-hours than implementing APFS from scratch. Maybe not.<p>Either way, we will now have yet another filesystem to contend with, implement in non-Darwin kernels (maybe), and this adds to the overall support overhead of all operating systems that want to be compatible with Apple devices. Since the older versions of macOS (OSX) don't support APFS, only HFS+, this means Apple and others will also have to continue supporting HFS+. It just seems wasteful of everyone's time to me.<p>Also: <a href="https://xkcd.com/927/" rel="nofollow">https://xkcd.com/927/</a>
<i>For example, my 1TB SSD includes 1TB (2^30 = 1024^3) bytes of flash but only reports 931GB of available space, sneakily matching the storage industry’s self-serving definition of 1TB (1000^3 = 1 trillion bytes).</i><p>Great article, but a couple nitpicking corrections (which seem appropriate for a storage article)
Per: <a href="https://en.wikipedia.org/wiki/Terabyte" rel="nofollow">https://en.wikipedia.org/wiki/Terabyte</a> - Terabyte is 1000^4, not 1000^3.<p>Also It's been 6+ years since we all agreed that TiB means 2^40 or 1024^4, and TB means 10^12. Indeed, <i>only</i> in the case of memory does "T" ever mean 2^40 anyways. It's always been the case that in both data rates, as well as storage, that T means 10^12. This convention is strong enough that we most of us just have thrown up our hands and agree when referring to DRAM memory, that Terabyte will mean 1024^4, and 1000^4 everywhere else.<p>Indeed, in the rare case where someone uses TiB to refer to a data rate, they are almost without exception incorrectly using it, and, they actually mean TB.
> Also, APFS removes the most common way of a user achieving local data redundancy: copying files. A copied file in APFS actually creates a lightweight clone with no duplicated data.<p>No, it doesn't. APFS supports copying files, if you want that. It's just that the default in Finder is to make a “clone” (copy-on-write).
I'm still looking for a widely supported (at least FreeBSD and Linux kernels) filesystem for external drives to carry around that doesn't have the FAT32 limitations. There's exFAT but no stable and supported implementation. Then there's NTFS, but that's also not 100% reliable in my experience when used through FUSE (NTFS-3G). I've considered UFS but that also was a no go. I'm hopeful for lklfuse[1] that also runs on FreeBSD and givess access to ext4, xfs, etc. in a way like Rump and allows you to use the same drivers on FreeBSD. I'm cautious though, given that I don't want corrupted data I might notice too late. Let's see if lklfuse provides LUKS as well, otherwise Dragonfly's LUKS implementation might need to be ported to FreeBSD or something like that. External drives one might lose need to be encrypted.<p>[1] <a href="https://www.freshports.org/sysutils/fusefs-lkl/" rel="nofollow">https://www.freshports.org/sysutils/fusefs-lkl/</a>
The file-level deduplication [1] is interesting. Not being a filesystem expert, this sounds like it fulfills a similar usecase to snapshots [2]. Or am I reading this wrong?<p>Is NTFS's shadow copy like Snapshots?<p>[1] <a href="http://dtrace.org/blogs/ahl/2016/06/19/apfs-part3/#apfs-clones" rel="nofollow">http://dtrace.org/blogs/ahl/2016/06/19/apfs-part3/#apfs-clon...</a><p>[2] <a href="http://dtrace.org/blogs/ahl/2016/06/19/apfs-part2/#apfs-snapshots" rel="nofollow">http://dtrace.org/blogs/ahl/2016/06/19/apfs-part2/#apfs-snap...</a>
I think the value of this new proprietary filesystem is limited, since you can't run it on servers (Apple does not make servers anymore). Also, compatibility/porting issues may become a problem if you build your software for it.