You can write an external memory spell checker with a tiny amount of RAM: something like<p><pre><code> - sort the words in the document
- eliminate unique words (they sort together)
- merge the sorted words with the sorted dictionary and keep only the missing words
</code></pre>
I saw this in BASIC in <i>Creative Computing</i> and got it working in on my TRS-80 Color Computer which had much less than 32k of available RAM, so that was the first thing I thought when I saw the headline.<p>Now this blew people away when it came out<p><a href="https://winworldpc.com/product/turbo-lightning/1x" rel="nofollow">https://winworldpc.com/product/turbo-lightning/1x</a><p>it had a compressed dictionary that would fit together with the other programs you were running on a PC and spell check <i>as you typed</i>; there was a 640k limit for the PC but it could only use a fraction of that so as not to interfere and in the early days of the PC you couldn't actually afford to fill it out.
> At this time, Bloom filter was not even called Bloom filter. In his paper, Douglas calls it a “superimposed code scheme”.<p>A Bloom filter is a specific type of superimposed code.<p>Calvin Mooers developed random (1) superimposed coding in his Master's thesis at MIT back in the 1940s, directly influenced by Shannon's work.<p>Bourne's superb 1963 book "Methods of Information Handling" gives details of the mathematics.<p>I've no doubt Douglas knew about the broader technique, which, for example, the author of "The Large Data Base File Structure Dilemma" (1975) at <a href="http://dx.doi.org/10.1021/ci60001a005" rel="nofollow">http://dx.doi.org/10.1021/ci60001a005</a> described as "an old technique called super-imposed coding".<p>(1) The "random" is an important qualifier because there were superimposed codes predating Mooers, but they were not mathematically interesting or all that practically important.
Way too smart for worse is better. Think Worser!<p>The main memory bandwidth and the disk bandwidth were about the same, a little of 1MB/s.<p>I would have done this in multiple passes (but still used the Bloom Filters, those are cool).<p><a href="https://github.com/arnoldrobbins/v10spell">https://github.com/arnoldrobbins/v10spell</a><p><a href="https://code.google.com/archive/p/unix-spell/" rel="nofollow">https://code.google.com/archive/p/unix-spell/</a><p>The original paper is great <a href="https://www.semanticscholar.org/paper/Development-of-a-Spelling-List-McIlroy/e08c8a4c17f23c41616649ca73a908d06828d67f" rel="nofollow">https://www.semanticscholar.org/paper/Development-of-a-Spell...</a><p>It is hosted on his web page <a href="https://www.cs.dartmouth.edu/~doug/" rel="nofollow">https://www.cs.dartmouth.edu/~doug/</a><p><a href="https://en.wikipedia.org/wiki/Douglas_McIlroy" rel="nofollow">https://en.wikipedia.org/wiki/Douglas_McIlroy</a><p>If you are a word nerd, you will have found obovate and there, this chart.<p><a href="https://upload.wikimedia.org/wikipedia/commons/e/e8/Leaf_morphology.svg" rel="nofollow">https://upload.wikimedia.org/wikipedia/commons/e/e8/Leaf_mor...</a>
I can't remember the name of the product but in the 80s there was a <i>hardware</i> spell checker for the IBM PC. It was a box that connected between your keyboard and your PC, and if you ever typed a string of letters that it did not recognize as a dictionary word, it would beep to let you know.
One of the things that got me intrigued by Unix was an early 1980s(ish) Byte article which walked through building a (trivial example, not the "real" one) spell checker out of a split/sort/comm pipeline, something like 7 commands? 8-bit PCs didn't have anything like that, and yet it didn't look like it needed <i>that</i> much sophistication...
Just finished reading the article finally (thanks!). The crux of it for me was:<p>- They had a "dictionary" of 30000 words, and accepting a ~1/4000 rate of false positives meant that if they hashed each word to a 27-bit string (integer), they could throw away the dictionary and the problem reduces to storing a set of 30000 27-bit strings.<p>- Somewhat surprisingly, information theory tells us that 30000 27-bit strings can be stored using not 27 but just ~13.57 bits per word. I understand the math (it's straightforward: <a href="https://www.wolframalpha.com/input?i=log_2%282%5E27+choose+30000%29%2F30000" rel="nofollow">https://www.wolframalpha.com/input?i=log_2%282%5E27+choose+3...</a> ) but it will probably take me a while to stop finding this counterintuitive, as 30000 is so small compared to 2^27 (which is ~134 million) that it is hard to see where the gains come from.<p>- To encode this 30000-sized subset of 27-bit hashes, they used hash differences, which turn out to be geometrically distributed, and a coding scheme tuned for geometrically distributed input (Golomb coding), to actually achieve ~13.6 bits per word.<p>I've tried to think of how one could do better, even in principle and with infinite time, along the lines of “perfect hashing” — maybe there should be a function that will take an alphabetic word, do some transformations on it, and the resulting hash will be easy to verify for being in the good set vs not. But thinking about it a bit more, the fact that we need that false-positive rate (non-dictionary words shouldn't get mapped to anything in the "good" set) requires us to use at least 27 bits for the hash. What they did seems basically theoretically optimal? Or can there exist a way to map each word to a 27-bit integer, such that the good strings are those with values less than 30000, say?
For perspective, in 1983 or so, Grammatik on CP/M ran in under 64k and did "grammar checking" (spell checking, plus a bunch of expert system rules) on an 8-bit system. (It sticks in my memory because of the time spent poking at the <i>really</i> interesting part: that it was so compact because it was in Forth, and there was enough of an outer interpreter in the product that with a little hex editing you could just use it as a Forth interpreter - with a <i>very</i> specialized set of functions preloaded :-)
I wonder what common typos this misses thanks to the hashing?<p>Related, a contest about compressing the Wordle dictionary: <a href="http://golf.horse/wordle/" rel="nofollow">http://golf.horse/wordle/</a>
in the mid 80's i ran into something similar. Fast is relative.<p>I had a lot of data, 640KB RAM, 64KB of heap, and 64KB of stack. I had hundreds of megabytes that I had to search extract data from and then combine some of them.<p>I experimented with data index structured into ternary trees. Conceptually it made sense, but implementation-wise the relationships and paths were still too big to keep in 64KB.<p>Instead of compression, I did swapping. I wrote a TSR (think service), a piece of code that would process a chunk of the data, extract the results, store it n the stack, dump the original data, make an interrupt call to the TSR, which in turn destroy the heap, and read in the next chunk from storage, return control to the program, process, combine with stack data, and continue until finished the entire process.<p>Originally this process took about a week for three data entry persons (think about a dozen 3" ring binders filled with tables), and an specialist combining the information. The program completed the work in just a few hours. It was amazingly "fast".<p>This was on a single threaded system.<p>[0] <a href="https://en.wikipedia.org/wiki/Terminate-and-stay-resident_program" rel="nofollow">https://en.wikipedia.org/wiki/Terminate-and-stay-resident_pr...</a>
I can remember using UNIX spell with the '-b' option, because I am British. There were only two language options, and now I want to know what the decision making was behind that, how the code catered for that and where the respective dictionaries came from. Did Australians and New Zealanders use British spelling or American?<p>UNIX spell was the 'ZX81 1K chess' of spelling, and, on home computers, we did not have a lot of spell checking going on until MS Word for Windows 3.1. Before then, in offices, secretaries did the typing with WordPerfect. They were human spell checkers for their respective managers and teams.<p>Meanwhile, at home, with our dot matrix printers and flickery screens, we were winging it with paper dictionaries for all of those early years of computing. I can't remember spell checking as being that important back then as everyone could spell. I was in a school of a thousand and there was only one kid that claimed to be dyslexic, a plausible excuse for not being able to spell. Maybe the 1980s was literacy's golden age with there being a clear start date for the decline in our spelling ability, that being the day UNIX spell was written.<p>I like to play Scrabble. Although a very different problem to spell checking, the process shares some steps with UNIX spell. Common word prefixes and suffixes are identified and bolted together in the rack or on the board with other components. Then a Scrabble dictionary is a bit like UNIX spell as it is just a big dictionary of words with no meanings provided. All that matters is whether a given word is in the book or not. It also has a few special look up tables such as the 102 two letter words.
Reading about this and similar techniques in Programming Pearls (Second Edition) by Jon Bentley left the younger me spellbound. Similar to the evolution of linkers up to mold.
How about 39kB for a video game with physics, dynamic graphics, two music tracks, sound effects, online high scores, and built-in instructions? <a href="https://news.ycombinator.com/item?id=38372936">https://news.ycombinator.com/item?id=38372936</a>
This spell check must have come later. The original spell check burst the words of the document, and sort -u'd them:<p>makewords sentence | lowercase | sort | unique | mismatch
"In order to secure funding for Unix, Ken Thompson and Dennis Ritchie pitched Unix as a text processing system for the patents department to AT&T. Naturally, a text processing system needed a spell checker as well."<p>I still use UNIX every day primarily for text processing.
Ah, the Good Old Days. Back when a bit of memory cost more than most butter churns, and the ultimate data transfer technology was a sharp Express rider with his saddlebags full of punch cards...<p>Back in the early <i>19</i>80's, I recall futzing around a bit with dictionary compression algorithms. Very interesting - but it didn't take me long to conclude that outperforming `spell` wouldn't be easy. Nor worth the effort.
> "...developed a compression algorithm that came within 0.03 bits of the theoretical limit of possible compression."<p>Since when can we subdivide <i>bits</i>!?
> But 27-bit hash codes were too big: with 2^15 words, they needed 2^15 * 27 bits of memory, while the PDP-11 had only 2^15 * 16 bits (64kB) of RAM—compression was essential.<p>I'm frustrated when people put this kind of typography on the web. HTML can do superscript.