I originally wrote this article explaining the data model as implemented in the original Unix "spell": <a href="https://blog.codingconfessions.com/p/how-unix-spell-ran-in-64kb-ram" rel="nofollow">https://blog.codingconfessions.com/p/how-unix-spell-ran-in-6...</a>.<p>This reached Doug, who then responded to it with the follow up work he did which wasn't published anywhere.
The text of the email (courtesy of Firefox/macOS Text recognition):<p>___________________<p>Subject: yes, it's me, the author of "spell"<p>You did a nice job of gently describing "spell"s data representation. But the formula x = 2^{log_2(m)}-m is a complicated way to say x=0. Did you mean that?<p>When memory got bigger, I kept the literal dictionary in memory, but compressed it on secondary memory for fast transfer. The compression was trivial: store a suffix preceded by one byte that contained the length of the prefix that the word shared with its predecessor in dictionary order.<p>We also added one byte per enry for affixing info, e.g. whether in- (or im- or ir-) is preferred over un-, whether a final consonant is doubled before adding a suffix that begins with a vowel, part of speech, etc. Although there were lots of such attributes, the number of distinct combinations of attributes that actually occurred was fewer than 256, so could be coded into one byte, and decoded by lookup in a 256-entry table. I automated the construction of that table, which would change if a dictionary revision created a new combination of attributes.<p>With affixing info we could at the same time be more aggressive about affixing, and make fewer mistakes like allowing -s on a word that cannot be used as a noun or a verb.
There used to be a page of Doug McIlroy "facts". I managed to hunt this page down by searching around.<p><a href="http://git.9front.org/plan9front/plan9front/c28f07b5840216c4e83e1d3b21d8a9ed7c41f8f9/lib/dougfacts/f.html" rel="nofollow">http://git.9front.org/plan9front/plan9front/c28f07b5840216c4...</a><p>One of the "facts" is:
> Doug McIlroy can address 8 terabytes of RAM with only 32 bits.
> The compression was trivial: store a suffix preceded by one byte that contained the length of the prefix that the word shared with its predecessor in dictionary order<p>This is more or less what was speculated upon as compatible with his stemming ideas [1], but I (and partly McIlroy1982) doubt you ever really needed the fancy hashing-compression in the first place (though I'm sure it was fun to figure out!). I got better perf (3x faster on a 5500 word document with an 82,000 word "unabridged" dictionary that prefix-compresses to only 278K which would fit on a 360K floppy of the era) than v7spell on v7 with just a merge algorithm somewhat more finely tuned against that format. Of course, on a shared computer getting either the streaming disk access (without seeks spoiling the party) or all that RAM to yourself (without swapping spoiling the party) is a guess. And, yeah, probably the background Doug mentions is all in the combined-unix-git-history by now { and it would not have been as attractive an academic paper :-) }.<p>[1] <a href="https://news.ycombinator.com/item?id=42931145">https://news.ycombinator.com/item?id=42931145</a>
Recent and related:<p><i>How Unix spell ran in 64kb RAM</i> - <a href="https://news.ycombinator.com/item?id=42752604">https://news.ycombinator.com/item?id=42752604</a> - Jan 2025 (51 comments)
Slightly related:<p>In the paper Leidner and Plachouras (2017), we reported an observation initially
due to Bentley, namely that the McIllroy spell(1) implementation emailed a list of unknown words when it was run over a document to the author. While technically, this is a neat way to increase dictionary size by "mining user data", and certain versions of Microsoft Word and Microsoft Edge (see
<a href="https://news.ycombinator.com/item?id=35208333">https://news.ycombinator.com/item?id=35208333</a> ) had the same behavior, it is privacy-violating at least if users are not informed beforehand [1]. Of course this has to be seen in the cultural context of the UNIX community at the time (people were even weary of using passwords then to protect their accounts), but still harm could be done if an email tasked about "malinoma" when that term was still absent from the dictionary, possibly revealing a sensitive condition of the email author or their circle of friends and family to a third party, the software engineer of the speel checker.<p>[1] <a href="https://aclanthology.org/W17-1604.pdf" rel="nofollow">https://aclanthology.org/W17-1604.pdf</a>
Wow, Doug McIlroy is 92 and still sharp, working through the implementation. Too many times, people complain of being over the hill when they hit 30!. Mental capability doesn't slow that dramatically (as physical) good perspective to have.<p><a href="https://en.wikipedia.org/wiki/Douglas_McIlroy" rel="nofollow">https://en.wikipedia.org/wiki/Douglas_McIlroy</a>