To anyone screen capturing small fonts as a demonstration, or capturing digital text especially at a small resolution, I don't believe that that is the purpose of this OCR library. (As a specialized problem, that might be easier to solve depending on the typeface.)<p>A much better example that works quite well is a picture of someone holding a book: <a href="http://i.imgur.com/3JWs64x.jpg" rel="nofollow">http://i.imgur.com/3JWs64x.jpg</a><p><pre><code> Magic .
Read this to yourself. Read it silently
Don't move your lips. Don’t make a suund
Listen to yourself. Listen without hearing
What a wonderfully weird thing, huh?
NOW MAKE THIS PART LOUD!
SCREAM IT IN YOUR MIND!
DROWN EVERYTHING OUT.
Now, hear a whisper. A tiny whisper.
New, read this next line with your best crotchety—
old-man voice:
“Hello there, sonny. Does your town have apost 0
Awesome! Who was that? Whose voice was that?
It sure wasn’t yours!
How do you do that?
How?!
Must be magic.
</code></pre>
Problems with this text: misspelled 'sound' as 'suund', didn't recognize the word 'anything', and mis-recognized 'a post office' as 'apost 0'.<p>Not bad. Especially since two of three mistakes are on the edge of the page.
The text detection is lacking in comparison to Google's Vision API. Here is a real-life comparison between Tesseract and Google's Vision API, based on a PDF a user of our website uploaded.<p>Original text [<a href="http://i.imgur.com/CZGhKhn.png" rel="nofollow">http://i.imgur.com/CZGhKhn.png</a>]:<p>> I am also a top professional on Thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well<p>Google detects [<a href="http://i.imgur.com/pSJym1x.png" rel="nofollow">http://i.imgur.com/pSJym1x.png</a>]:<p>> “ I am also a top professional on Thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well ”<p>Tesseract detects [<a href="http://i.imgur.com/wwbLU6g.png" rel="nofollow">http://i.imgur.com/wwbLU6g.png</a>]:<p>> \ am also a mp pmfesslonzl on Thummack wmcn Is a sue 1m peop‘e \ookmg (or professmna‘
semces We on glg salad P‘ezse see my rewews 1mm my cuems were as weH
HOW is there not a better, almost 100% accurate OCR tool?<p>I routinely (daily) need to OCR PDF files. The PDF files are not scans. They are PDF files created from a Word file. The text is 100% clear, the lines are 100% straight, and the type is 100% uniform.<p>And, yet, Microsoft and Google OCR spits out gibberish that is full of critical errors.<p>From a problem solving perspective, this seems like an incredibly easy problem to solve in this exact use case. That is, PDFs generated from text files. Identify a uniform font size (prevent o-to-O and o-to-0 errors), identify a font-family (serif, sans-serif, narrow to particular fonts), and OCR the damn thing. And yet, the output is useless in my field.
For all those claiming issues with reading text from a screen shot of this page, note that this is more an issue with the original Tesseract library, not this library (which appears to wrap Tesseract compiled through Emscripten). I remember having a similar issue when I used the original Tesseract. The quick hack I found to fix it was to rescale any small text input images 3x first before feeding it to Tesseract. I'm sure there's more intelligent solutions to mitigate that problem.
Why the promise-<i>like</i> interface? If it returned a promise with a this-returning progress method monkey-patched onto it, then you could use it otherwise like a regular promise:<p><pre><code> Tesseract.recognize(myImage)
.progress(function(message){console.log(message)})
.then(function(result){console.log(result)})
.catch(function(err){console.error(err)});
</code></pre>
or<p><pre><code> Tesseract.recognize(myImage)
.progress(function(message){console.log(message)})
.then(
function(result){console.log(result)},
function(err){console.error(err)}
);
</code></pre>
I guess I just still have bad memories of jQuery's old almost-like-real promises. I'd rather never have to think ever again about whether I'm dealing with a real promise or one that's going to surprise me and break at run-time because I tried to use it like a real one.
Excited about this... but the OCR quality seems to be very bad. Maybe it's not optimized for recognizing black text on a white background.<p>For example, I took a screenshot of this comment and ran it through the demo and got this:<p>Excited ehent this... but the OCR enenty Seems te be very bad. Maybe it's het Dptimized far
recngnizing black text an e white heckgmnhe.
EDI example, 1 tank e Screenshnt at this cement ehe teh it. thmneh the den» ehd get this:<p>It seems to recognize the bounding boxes just fine but mangles the words.
I've been using this library to read screenshots of Pokemon Go to automatically calculate Individual Values for each Pokemon[1] It's worked great on desktop, but on mobile safari where it matters most the library causes the browser to crash :(<p>1: <a href="https://github.com/goatslacker/pokemon-go-iv-calculator/blob/master/web/components/PictureUpload.js" rel="nofollow">https://github.com/goatslacker/pokemon-go-iv-calculator/blob...</a>
Tesseract was one of the best publicly-available CAPTCHA solvers when I was playing around with that stuff a few years ago; I remember somewhere in the neighbourhood of 90%+ accuracy on ReCAPTCHA, no wonder they've changed those considerably since then to make it difficult even for humans.
I've always wanted to use Tesseract on .NET projects but it was always clumsy (wrappers). This looks like it'll make things easier.<p>Thanks for putting this out.
> Drop an English image on this page to OCR it!<p>This looks great, and I'd really love to but<p>> Uncaught ReferenceError: progress is not defined<p>EDIT: works now!
Languages list link is broken - getting 404 for the following
<a href="https://github.com/naptha/tesseract.js/blob/master/tesseract_lang_list.md" rel="nofollow">https://github.com/naptha/tesseract.js/blob/master/tesseract...</a>
Does this mean I can implement Tesseract on my home server without using php's shell_exec to perform magic on my files? I can just use Jscript instead? Cool!<p>My current HRCloud2 project could benefit greatly if I ever get around to it. Currently I make the php interpreter jump through hoops and move stuff all over the place to OCR images and docs. This could save a ton of time and shift the processing to the client instead of my server.
Impressive that this is pure JS, however trying an image cut from the page itself gave this result<p>> Dropan Enghsh Wage on (Ms page to OCR m<p>Should be<p>> Drop an English image on this page to OCR it!
The title and description are very misleading: this is technically "pure JavaScript" but the JS is compiled from the original C++ library of the same name using emscripten. I think "pure JS" would imply that all of its sources are written in JS which is not the case here. It's mostly the C++ code doing the actual work, with a little JS wrapper on top.
Pretty cool. I screen captured the text in the bottom right corner of the page and it had some issues. Here's a screenshot of what happened: <a href="http://io.kc.io/hkeM" rel="nofollow">http://io.kc.io/hkeM</a>
For those who may be interested;<p>I threw together a quick proof-of-concept in Go for exposing tesseract via a web API:<p><a href="https://github.com/jaytaylor/tesseract-web" rel="nofollow">https://github.com/jaytaylor/tesseract-web</a>
Does this include taking a text and for example, when viewing it, 'wiping' the text in the logical native language order?<p>For languages that don't employ much whitespace, this would be nice.
Is this at all affiliated with the already-existing tesseract OCR library? It doesn't seem to be from my cursory check so if not you need to rename your library, because you're ripping off their name.<p><a href="https://github.com/tesseract-ocr/tesseract" rel="nofollow">https://github.com/tesseract-ocr/tesseract</a>