TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Douglas McIlroy responds to Unix spell article with new implementation details

163 点作者 abhi9u4 个月前
I originally wrote this article explaining the data model as implemented in the original Unix &quot;spell&quot;: <a href="https:&#x2F;&#x2F;blog.codingconfessions.com&#x2F;p&#x2F;how-unix-spell-ran-in-64kb-ram" rel="nofollow">https:&#x2F;&#x2F;blog.codingconfessions.com&#x2F;p&#x2F;how-unix-spell-ran-in-6...</a>.<p>This reached Doug, who then responded to it with the follow up work he did which wasn&#x27;t published anywhere.

7 条评论

re4 个月前
The text of the email (courtesy of Firefox&#x2F;macOS Text recognition):<p>___________________<p>Subject: yes, it&#x27;s me, the author of &quot;spell&quot;<p>You did a nice job of gently describing &quot;spell&quot;s data representation. But the formula x = 2^{log_2(m)}-m is a complicated way to say x=0. Did you mean that?<p>When memory got bigger, I kept the literal dictionary in memory, but compressed it on secondary memory for fast transfer. The compression was trivial: store a suffix preceded by one byte that contained the length of the prefix that the word shared with its predecessor in dictionary order.<p>We also added one byte per enry for affixing info, e.g. whether in- (or im- or ir-) is preferred over un-, whether a final consonant is doubled before adding a suffix that begins with a vowel, part of speech, etc. Although there were lots of such attributes, the number of distinct combinations of attributes that actually occurred was fewer than 256, so could be coded into one byte, and decoded by lookup in a 256-entry table. I automated the construction of that table, which would change if a dictionary revision created a new combination of attributes.<p>With affixing info we could at the same time be more aggressive about affixing, and make fewer mistakes like allowing -s on a word that cannot be used as a noun or a verb.
评论 #42995298 未加载
devin4 个月前
There used to be a page of Doug McIlroy &quot;facts&quot;. I managed to hunt this page down by searching around.<p><a href="http:&#x2F;&#x2F;git.9front.org&#x2F;plan9front&#x2F;plan9front&#x2F;c28f07b5840216c4e83e1d3b21d8a9ed7c41f8f9&#x2F;lib&#x2F;dougfacts&#x2F;f.html" rel="nofollow">http:&#x2F;&#x2F;git.9front.org&#x2F;plan9front&#x2F;plan9front&#x2F;c28f07b5840216c4...</a><p>One of the &quot;facts&quot; is: &gt; Doug McIlroy can address 8 terabytes of RAM with only 32 bits.
评论 #42998758 未加载
cb3214 个月前
&gt; The compression was trivial: store a suffix preceded by one byte that contained the length of the prefix that the word shared with its predecessor in dictionary order<p>This is more or less what was speculated upon as compatible with his stemming ideas [1], but I (and partly McIlroy1982) doubt you ever really needed the fancy hashing-compression in the first place (though I&#x27;m sure it was fun to figure out!). I got better perf (3x faster on a 5500 word document with an 82,000 word &quot;unabridged&quot; dictionary that prefix-compresses to only 278K which would fit on a 360K floppy of the era) than v7spell on v7 with just a merge algorithm somewhat more finely tuned against that format. Of course, on a shared computer getting either the streaming disk access (without seeks spoiling the party) or all that RAM to yourself (without swapping spoiling the party) is a guess. And, yeah, probably the background Doug mentions is all in the combined-unix-git-history by now { and it would not have been as attractive an academic paper :-) }.<p>[1] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42931145">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42931145</a>
dang4 个月前
Recent and related:<p><i>How Unix spell ran in 64kb RAM</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42752604">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42752604</a> - Jan 2025 (51 comments)
jll294 个月前
Slightly related:<p>In the paper Leidner and Plachouras (2017), we reported an observation initially due to Bentley, namely that the McIllroy spell(1) implementation emailed a list of unknown words when it was run over a document to the author. While technically, this is a neat way to increase dictionary size by &quot;mining user data&quot;, and certain versions of Microsoft Word and Microsoft Edge (see <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35208333">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35208333</a> ) had the same behavior, it is privacy-violating at least if users are not informed beforehand [1]. Of course this has to be seen in the cultural context of the UNIX community at the time (people were even weary of using passwords then to protect their accounts), but still harm could be done if an email tasked about &quot;malinoma&quot; when that term was still absent from the dictionary, possibly revealing a sensitive condition of the email author or their circle of friends and family to a third party, the software engineer of the speel checker.<p>[1] <a href="https:&#x2F;&#x2F;aclanthology.org&#x2F;W17-1604.pdf" rel="nofollow">https:&#x2F;&#x2F;aclanthology.org&#x2F;W17-1604.pdf</a>
评论 #42998986 未加载
评论 #42997173 未加载
LeoPanthera4 个月前
No login required: <a href="https:&#x2F;&#x2F;nitter.net&#x2F;abhi9u&#x2F;status&#x2F;1887010136155414602" rel="nofollow">https:&#x2F;&#x2F;nitter.net&#x2F;abhi9u&#x2F;status&#x2F;1887010136155414602</a>
npalli4 个月前
Wow, Doug McIlroy is 92 and still sharp, working through the implementation. Too many times, people complain of being over the hill when they hit 30!. Mental capability doesn&#x27;t slow that dramatically (as physical) good perspective to have.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Douglas_McIlroy" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Douglas_McIlroy</a>
评论 #43003786 未加载