In case anyone wonders: I tried if Google could solve its own captchas. It can, if each character is separated, but once they overlap, like they usually do, it doesn't work.
I find it tremendously frustrating that so many people are creating this problem for themselves.<p>Anything that needs to be data should be data, not images. Except for some very specific cases, you're not doing anybody any favors by outputting PDF. That format is a data black hole. It allows you to transmit very well-formatted output, but it absolutely <i>stops</i> you from reliably <i>using</i> anything in that content.<p>I beg you all: if it's anything that contains data, or really, if it's anything for which layout and formatting is not absolutely critical, please don't use PDF. Send data as data.
Has anyone checked to see if this works with Japanese, Korean, or Chinese? What about Arabic or Hindi? This would shed some light on whether it's likely to be tesseract or ocrpus....
Wow I just tested with an image, and you get a GDoc with the image on top and the OCRed text in the bottom.<p>Pretty cool.<p>I wonder what are they using for Google Goggles and this
Incidentally, I noticed that if you try to use tesseract on an image taken from a Google Books page, you get terrible OCR accuracy. Anyone know why that is?
Trying to improve some scanned forms I have, I got an average of 5 characters per page recognized. Also form formatting recognized as "1 1 1 1 1 1 1 1 1 1 1 1 1".<p>I may not rely entirely on google docs for my OCR needs in future ;)