Even though this happened a long time ago, whenever I hear or think about it, I am a mazed that it didn't put Xerox out of business, or at least hurt a little more. After all, some big players were already doing digital archiving at the time. :-/ BTW, the CCC had a pretty neat presentation at that time as well: <a href="https://www.youtube.com/watch?v=c0O6UXrOZJo">https://www.youtube.com/watch?v=c0O6UXrOZJo</a>
This can happen in some compression modes of DjVu as well at high compression factors, where the background and foreground is separated and the foreground (text, usually) is split into glyphs that can be shared by different instances. Mess up the recognition and the letters on the page appear literally different in the compressed "oulput".
Using OCRmyPDF, I applied lossless JBIG2 compression to a scanned book, after some consideration.<p>* the OCRmyPDF docs point to the JBIG2 Wikipedia page, and the Disadvantages section - <a href="https://en.wikipedia.org/wiki/JBIG2#Disadvantages" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/JBIG2#Disadvantages</a> - so it's easier to avoid this bug<p>* I'd hoped OCR would fall out of the process, but nope<p>* from the Wikipedia page, huh, the Pegasus malware exploited iOS's implementation of JBIG2
One of the more amusing parts of this blog is that it contains at least two typos: "arrors" and "ancoding" and I can't tell if they were on purpose.
I remember this talk quite well. Also the other talks by David are interesting.<p>Somehow there are not really consequences on this. So either archiving stuff, at least in the business context, is not really important. Or we simply trust these copies. The latter one is of course scary.
Is this a Xerox specific issue or industry wide problem? The author suspected that this is not an OCR issue What about out scanners? HP, Epson, Cannon, Ricoh, etc?
Please add (2913) to the title:<p><a href="https://hn.algolia.com/?query=Xerox%20scanners%20randomly%20alter%20numbers%20in%20scanned%20documents&type=story&dateRange=all&sort=byDate&storyText=false&prefix&page=0" rel="nofollow noreferrer">https://hn.algolia.com/?query=Xerox%20scanners%20randomly%20...</a><p>Edit: oops, 2013 - typed on my smartphone without reading back :)