科技回声

3 条评论

I have the Kindle version of The Seleucid Royal Economy which for obvious reasons includes Greek text.It's been OCRed, and the Greek has been mangled beyond belief. Sometimes the OCR will split a single character.No real point to the story, but it feels relevant here. I see Rescribe has already encountered the problem: "In the second step we run the OCR on the preprocessed files, using our specifically trained packages and adapting language and character settings to the document at hand."(I'm only complaining to a very small degree. Having a low-quality OCRed ebook available is much better than having no ebook available. And what is normally displayed is the image of the text, not the OCRed nonsense, so it doesn't matter that the Greek has been transformed into gibberish until you encounter the odd mid-character word break.)

raybb超过 3 年前

I think the folks at OpenLibrary.org would benefit from something like this.

评论 #29352675 未加载

IshKebab超过 3 年前

Is Tesseract any good yet? Last I heard they were experimenting with deep learning based recognition but before that I've tried it and it didn't work at all. Kind of Pocketsphinx levels of rubbish.

评论 #29352987 未加载

评论 #29353587 未加载

评论 #29351985 未加载

Rescribe: A high quality OCR tool for historic books

3 条评论

Rescribe: A high quality OCR tool for historic books

3 条评论