Sometimes I think archivists are so obsessed with getting perfect scans and every-pixel-is-precious that scanning books becomes too costly and so never happens.<p>A simple alternative is to just collect some volunteers with iphones and have one person turn pages while the other just clicks the shutter. You could easily do 20 pages a minute, 1200/hr, 10000/day. I bet those acres of books could be ground through in reasonably good time.<p>Of course, the images would horrify an archivist. But try it yourself with a random book. They're quite serviceable. At the very least, one then has a backup in the case of a catastrophe at the library.<p>OCRing them is an entirely separate issue.
Very interesting. Way back my old university was also involved in historical document processing <a href="https://www.rug.nl/research/portal/files/40224455/Chapter_7.pdf" rel="nofollow">https://www.rug.nl/research/portal/files/40224455/Chapter_7....</a> they also looked at things like writer identification and trying to automatically date the documents using a wide array of hand crafted features. Curious what would happen with some of the newer deep learning models, but the project has been dead for a while <a href="http://application02.target.rug.nl/cgi-bin/monkweb?db=All&cmd=scroogle" rel="nofollow">http://application02.target.rug.nl/cgi-bin/monkweb?db=All&cm...</a> … as these things go
I found this part interesting:<p>> In texts transcribed so far, a full one-third of the words contained one or more typos, places where the OCR guessed the wrong letter. [...] Still, the software got 96 percent of all handwritten letters correct.<p>96% correct sounds pretty good but that's still multiple errors per sentence! The threshold for truly "error-free" is quite high...
Sloppy summary: some researcher has trained some NN or whatever to segment and then ocr old handwritten text and hopes to use it on the enormous archive the Vatican has. Apparently because if it's not scanned its almost completely useless to "modern scholars", which I take to mean those historians that only read medieval latin if its printed on a screen...