I read an article where a software engineer was about to go on a long plane flight, so he downloaded a file of all english words and tried to write a spell checker during his flight, with no internet access, just as an exercise. And when he landed, he had finished, and his algorithm was beautiful, simple, elegant, easy to understand, and never something I'd have come up with in a million years.<p>It was actually a little depressing - it made me realize that some people just have a "something" that I'm not going to get through hard work or whatnot.<p>*edit: ah, found it. <a href="https://norvig.com/spell-correct.html" rel="nofollow">https://norvig.com/spell-correct.html</a>
No one posted Martha Snow's epic poem yet?<p>Eye halve a spelling chequer
It came with my pea sea
It plainly marques four my revue
Miss steaks eye kin knot sea.<p>Eye strike a quay and type a word
And weight four it two say
Weather eye am wrong oar write
It shows me strait a weigh.<p>As soon as a mist ache is maid
It nose bee fore two long
And eye can put the error rite
It's rare lea ever wrong.<p>Eye have run this poem threw it
I am shore your pleased two no
It's letter perfect awl the weigh
My chequer tolled me sew.
It still blows my mind that when growing up I had a Commodore 64 word processor with spell check. To run the spell checker you had to hit the button to start the spell checker and then turn the floppy upside down while it ran through the document.<p>This means you had to store in a measly 64kb not only the code for the spell checker, the entire contents of the document, KERNAL overhead, and enough data for the spell checker to run. Remember that the Commodore 64 had an incredibly slow disk drive (300bps) and the disks only supported 160kb of data. Any scheme that required a lot of disk access would be terribly slow. Yet somehow it worked and wasn't outrageously slow. I suspect the program was unloading big chunks of its own program to aggressively reuse the memory space but even then it seems like magic.
If you think this is great and/or immediately thought of using a trie to reduce storage costs, then let me introduce you to <a href="https://blog.burntsushi.net/transducers/" rel="nofollow">https://blog.burntsushi.net/transducers/</a> which pushes storage reuse over the edge to almost competing with compression like gzip while still allowing you to consider edit distance while finding possible alternatives.<p>typeaheads/autocomplete and spellchecking both need this so worth a read.
It used to be a popular submission topic too (still is, and used to as well):<p><i>A spellchecker used to be a major feat of software engineering (2008)</i> - <a href="https://news.ycombinator.com/item?id=25296900" rel="nofollow">https://news.ycombinator.com/item?id=25296900</a> - Dec 2020 (143 comments)<p><i>A Spellchecker Used to Be a Major Feat of Software Engineering (2008)</i> - <a href="https://news.ycombinator.com/item?id=10789019" rel="nofollow">https://news.ycombinator.com/item?id=10789019</a> - Dec 2015 (29 comments)<p><i>A Spellchecker Used to Be a Major Feat of Software Engineering</i> - <a href="https://news.ycombinator.com/item?id=4640658" rel="nofollow">https://news.ycombinator.com/item?id=4640658</a> - Oct 2012 (70 comments)<p><i>A Spellchecker Used To Be A Major Feat of Software Engineering</i> - <a href="https://news.ycombinator.com/item?id=3466388" rel="nofollow">https://news.ycombinator.com/item?id=3466388</a> - Jan 2012 (61 comments)<p><i>A Spellchecker Used to Be a Major Feat of Software Engineering</i> - <a href="https://news.ycombinator.com/item?id=212221" rel="nofollow">https://news.ycombinator.com/item?id=212221</a> - June 2008 (22 comments)
Three other things to think about:<p>1980s memory models especially dos-era PC was a nightmare but you could at great effort work around it. The C / unix memory model is actually quite luxurious compared to programming on dos 3.3.<p>In the old days non-English support was pretty complicated. Now you rely on the OS to support UTF-8 and call it good, more or less. Was a bigger problem 'back in the day'.<p>In the old days you'd run a spell checker as a batch process, possibly not even part of the editor itself. The point I'm making is WRT speed, now a days speed doesn't matter, not because processors are faster (although they are, software always expands faster than hardware) but because we have integrated app with threading, so instead of saving this into a file, then exiting my browser and running a separate spell check app then exit and start the browser again and load this file back into it, the browser simply has a thread doing spell checking "on the fly". So in the old days it was VERY important WRT labor costs to spell check a 10000 word essay as fast as possible, and on a 8 mhz XT that was a trick indeed. But today my multi core multi ghz desktop merely needs to spell check on the fly as I'm typing and I can't be faster than 100 wpm or so. My desktop had "like one second or so" to decide that ghz and mhz are not words and put red squigglies under them whereas in the old days you'd have to spell check like 1000 words/second to keep the users happy.<p>So simple ineffective code survives in 2023 because it can lean on the OS, its not just that hardware is faster or larger.
A good one still is. When I got my most recent phone (Galaxy 22, but I jumped several versions in the upgrade so I don't know when the change happened), the spellcheck engine regressed massively from previous editions.<p>I'd be so curious to hear what the change was. It's absolutely awful to use!
He's got a point: increase in minimum memory and increase in CPU performance means that superior programming isn't always necessary.<p>People regularly produce for fun what used to be a masters thesis level of work these days. I'll cite that YouTube video of someone making a reaction wheel system that can get an inverted Lego pendulum to stand up, and stay up under perturbations. The Lego part isn't the main work. The main work is software, a PID controller of the reaction wheel.<p>Something else has happened. There's enough people who know enough that the knowledge is just sloshing around, waiting for others to do cool things with it.
The really mind blowing thing about that 2 meg dictionary, is that it can completely fit in cache for many computers nowadays. Just mind blowing that the optimized linear scan search today is competitive with faster methods from yesteryear. (Heck, the unoptimized scan is probably more than competitive.)<p>This does have me curious how big the ZDD for the dictionary would be. <a href="https://en.wikipedia.org/wiki/Zero-suppressed_decision_diagram#ZDDs_as_dictionaries" rel="nofollow">https://en.wikipedia.org/wiki/Zero-suppressed_decision_diagr...</a>
It's kind of amazing how many hard computer problems from the olden days were solved by having more memory. When I got started in computer graphics, there was a lot of research work on how to render images a scanline at a time, since that's all the memory you had. Once you could get enough memory to hold the entire image, you could use a Z-buffer and rendering became kind of trivial.<p>Another example is the post on HN yesterday about old videogames and how they used a bunch of different hacks such as sprites and tiles because there wasn't enough memory for the whole screen.
<a href="https://nicole.express/2023/yes-popeye-the-sailor-man.html" rel="nofollow">https://nicole.express/2023/yes-popeye-the-sailor-man.html</a>
...but a really good grammar checker is still, to this day, a major feat.<p>Even Grammarly can't write sentences for you as good as a native speaker. ChatGPT and language models can, but ironically, they still require way more memory than fits on a desktop. Simple cases like conjugation can certainly be handled but there is a lot of nuance to make a sentence "flow".
The “current hotness” for spellcheck would be to use a large language model (LLM) like GPT-3 that can use several paragraphs of context to disambiguate between alternatives.<p>However usefully accurate LLMs are still too big to work locally in a laptop or PC, so once again we’re back to having to liberally apply black magic to squeeze the models into something an integrated GPU can run in its “tiny” VRAM that’s actually gigabytes in size.<p>The more things change, the more they stay the same.<p>PS: Spell check is a trivially solved problem for most languages, but not all! Hungarian is notoriously difficult, and there are ongoing projects to improve this: <a href="https://en.m.wikipedia.org/wiki/Hunspell" rel="nofollow">https://en.m.wikipedia.org/wiki/Hunspell</a>
In some ways, I think computational linguistics has missed a mark. We have dictionaries, lexicons, grammar engines, spell checkers, pluralization rules, tries, quotation disambiguation, sentiment analyzers, word segmentation, and more. You'd think we could roll all these into a unified, standard definition format.<p>My KeenWrite editor, for instance, uses:<p>* <a href="https://github.com/DaveJarvis/KeenSpell">https://github.com/DaveJarvis/KeenSpell</a> (spell check, lexicon)<p>* <a href="https://github.com/DaveJarvis/KeenQuotes">https://github.com/DaveJarvis/KeenQuotes</a> (curls straight quotes)<p>* <a href="https://github.com/DaveJarvis/KeenWrite/blob/main/R/pluralize.R">https://github.com/DaveJarvis/KeenWrite/blob/main/R/pluraliz...</a> (pluralization)<p>* <a href="https://lucene.apache.org/" rel="nofollow">https://lucene.apache.org/</a> (word segmentation)<p>I was looking at integrating LanguageTool[0] for grammar and realized that it has partial functionality for KeenQuotes (lexing and tokenization), duplicates the SymSpell algorithm used by KeenSpell, and because it offers grammar corrections it likely can pluralize words, as well.<p>Unifying those for English alone would be a massive undertaking. At present, KeenWrite parses prose in multiple, redundant passes. While this works, I think it'd be great to start with a basic lexer/tokenizer that can emit tokens at blazing speeds[1], which feeds into higher-level abstractions.<p>[0]: <a href="https://github.com/languagetool-org/languagetool">https://github.com/languagetool-org/languagetool</a><p>[1]: <a href="https://github.com/DaveJarvis/KeenQuotes/blob/main/src/main/java/com/whitemagicsoftware/keenquotes/lex/Lexer.java">https://github.com/DaveJarvis/KeenQuotes/blob/main/src/main/...</a>
And still is for most. I'm a horrible speller, but fairly consistent in my mistakes. Usually I use the wrong vowel with a similar sound in the middle of a word. It's amazing how few spell checkers can handle that. The google search bar is the undisputed king in spell checking in my experience. Microsoft word, even from way back, has been good. Thunderbird from about a decade ago was the most infuriatingly bad spellchecker I have ever encountered. I always recognize the correct spelling when I see it. So, in Thunderbird it would commonly refuse to suggest the word I was looking for but would often give me the "un" version of that word. I mean, come on. I can get the first letter right 9,999 times out of 10,000 and that is probably a gross underestimation. And the few I would get wrong are a handful of edge cases...
Random story. I was a young kid in the 90's and my dad took me by the biggest house I'd seen at that point in my life for some work party or something. Really nice landscaping, etc. It was clear the owner had A Lot Of Money.<p>I asked him afterwards how the guy got so rich and he told me that he wrote Microsoft's spellchecker.
Actually there's way more to it than loading /usr/share/dict/words into a perl hashtable and calling it done. A good spellchecker from scratch is still a massive engineering effort in 2023.
Now it’s just a major pain in my butt. At least the ones on phones. Half the time they replace a legitimate word with the one they thought I actually meant. They’re almost always wrong.
I dunno, there was an article in <i>Creative Computing</i> magazine circa 1984 about how to code one up in BASIC that required a little creativity but made it look pretty easy. If you sort the words in the document it is pretty quick to check them against a dictionary on disk.<p>Turbo Lighting, though, took advantage of the comparably large memory of the IBM PC to do something that really blew people away<p><a href="https://books.google.com/books?id=MC8EAAAAMBAJ&pg=PA7&lpg=PA7&dq=turbo+lightning+borland&source=bl&ots=xO6qUGfCFQ&sig=ACfU3U1i3tpDfjD6NaefKC_HEVwEpNBvhA&hl=en&sa=X&ved=2ahUKEwin8aCn67j9AhWIkIkEHS70ArEQ6AF6BAgyEAM#v=onepage&q=turbo%20lightning%20borland&f=false" rel="nofollow">https://books.google.com/books?id=MC8EAAAAMBAJ&pg=PA7&lpg=PA...</a>
Yet in 2022 spell checking is still not good, at least when it comes to ordinary software. It checks with dictionaries, it seems that recently Google Chrome became a little bit better at it, but still I can completely write gibberish random with order words of and it wud no car as long as separate words are present in the dictionary. And that's while there's AI which generates awesome texts around.<p>I don't know English well, but one single thing that immensely helps me memorizing English is instant feedback when I'm writing texts with errors. I bet that good spell checker in Chrome would do much more to the world than many English courses (also helps with any other language, of course, but English is special as it's world's lingua franca nowadays).
This reminded me of a Mr Wizard's World episode I watched in the 80s: <a href="https://www.youtube.com/watch?v=pdz_8AjIgfA">https://www.youtube.com/watch?v=pdz_8AjIgfA</a><p>In it he demos the UI of a word processor of that era on a green screen monitor. As I had a CGA screen/PC at the time word processing was pretty novel to me ... in white and black, but most of my friends did not own computers and those that did had Commodore 64s[0].<p>[0] I sat jealous despite sitting before a $3,000 8088 w/FPU, 10MB HDD and 640k of RAM that took 2 minutes before it started booting from the hard drive (I learned to count in powers of two up to 640 that way).
From Jon Bently, the history of Unix spell checking. [PDF]<p><a href="https://dl.acm.org/doi/pdf/10.1145/3532.315102" rel="nofollow">https://dl.acm.org/doi/pdf/10.1145/3532.315102</a>
As a person with dysgraphia I can assure you that it still is a major feet. The algorithms are fast, but they still do a bad job of turning the semi-random sequence of letters I come up with into real words.
Here's a play-by-play of a software engineer at Atari building a spellchecker for VAX mainframes in the 80's:<p><a href="https://atariemailarchive.org/thread/on-building-a-spellchecker-in-the-80-s-21" rel="nofollow">https://atariemailarchive.org/thread/on-building-a-spellchec...</a><p>It's a game of problem whack-a-mole in this email thread!
There are still "hard" problems to solve when it comes to spellchecking. For example predictive text and correct typo detection. Notice what suggestions the Firefox and chrome spellcheckers give you versus what happens when you use google or mobile phone autocorrect. There are still differences in spellchecking speed, quality, and UX.
It's still pretty bad in some fairly common languages like French. For instance "'" really trips spellchecker, as well as some tense (subjunctive). For instance "que je n'embête".
I remember that the early CPM edition of WordStar didn't have an interactive spell-checker, but you could run it in batch mode and get a report.<p>One way to implement that is to store say a paragraph or more of your document at a time in RAM, then do a sequential read of the dictionary file on disk, compare to the stored paragraph words, and save any non-finds to the misspell report.<p>Saving the position could allow one to bring the editor back up and have the misspelled words highlighted, which is what WordStar did as I remember it.
I spent some years working as a freelance proofreader and copyeditor, and came to regard writers' use of spellcheckers as basically a guarantee of my future employment.
Mildly funny anecdote: An early version of WordPerfect spell check insisted on turning the company name "Unisys" into "anuses". That was a good laugh.
In 1988, I had a friend in my computer science program who got some local notoriety when he made a spell checker for DOS. It was a terminate and stay resident program and it used a tiny bit of the available 640K of RAM.<p>We were all pretty amazed by it. He got some seed money to start a business and then was crushed like a year later when it was just built into wordstar.
You can still try your skillz here: <a href="https://open.kattis.com/problems/dictionary" rel="nofollow">https://open.kattis.com/problems/dictionary</a><p>It's probably easy mode compared with the original constraints in the article, but the problem is still somewhat challenging. It took me a sleepless night to get it right.
Perhaps that’s a good thing. There’s so much computing hardware these days that some things are basic “free”. Liberating (but also scary) implication is that we’re limited more by our ideas than the hardware, so go out there and create something cool, there’s little stopping you other than the ideas in your head.
This reminds me of an article on the JoS blog where Joel describes how hard/impossible it is to do real-time spell checking on the web with red underlines on the typos.<p>It really is mind blowing for those of us who remember spacer.gif and are still hacking today. The web has really grown up.
Word processors in the 70s had this as well. I remember the CPT 8100 having this as a selling point in their literature. There was no inline correction. Run the program, correct your issues then continue!<p>Anyone else remember Shift+F7? Bring out your Wordperfect templates.
I am dyslexic and my spelling is completely garbage without spellcheck. Even so, Spellcheck cannot recognize many of my misspellings.<p>I have always wondered if there is a way to recognize the word I am trying to spell by its shape (form) and length.
Spellcheckers still blow my mind and Gmail's autocomplete(still the best out there imo), I still marvel when they work even though I know how they work and have used them for a while.