Show HN: Tesseract.js – Pure JavaScript OCR for 60 Languages

727 pointsby bijectionover 8 years ago

28 comments

xigencyover 8 years ago

To anyone screen capturing small fonts as a demonstration, or capturing digital text especially at a small resolution, I don't believe that that is the purpose of this OCR library. (As a specialized problem, that might be easier to solve depending on the typeface.)A much better example that works quite well is a picture of someone holding a book: <a href="http://i.imgur.com/3JWs64x.jpg" rel="nofollow">http://i.imgur.com/3JWs64x.jpg</a><pre><code> Magic . Read this to yourself. Read it silently Don't move your lips. Don’t make a suund Listen to yourself. Listen without hearing What a wonderfully weird thing, huh? NOW MAKE THIS PART LOUD! SCREAM IT IN YOUR MIND! DROWN EVERYTHING OUT. Now, hear a whisper. A tiny whisper. New, read this next line with your best crotchety— old-man voice: “Hello there, sonny. Does your town have apost 0 Awesome! Who was that? Whose voice was that? It sure wasn’t yours! How do you do that? How?! Must be magic. </code></pre> Problems with this text: misspelled 'sound' as 'suund', didn't recognize the word 'anything', and mis-recognized 'a post office' as 'apost 0'.Not bad. Especially since two of three mistakes are on the edge of the page.

评论 #12695866 未加载

评论 #12697384 未加载

评论 #12698742 未加载

pyroniteover 8 years ago

The text detection is lacking in comparison to Google's Vision API. Here is a real-life comparison between Tesseract and Google's Vision API, based on a PDF a user of our website uploaded.Original text [<a href="http://i.imgur.com/CZGhKhn.png" rel="nofollow">http://i.imgur.com/CZGhKhn.png</a>]:> I am also a top professional on Thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as wellGoogle detects [<a href="http://i.imgur.com/pSJym1x.png" rel="nofollow">http://i.imgur.com/pSJym1x.png</a>]:> “ I am also a top professional on Thumbtack which is a site for people looking for professional services like on gig salad. Please see my reviews from my clients there as well ”Tesseract detects [<a href="http://i.imgur.com/wwbLU6g.png" rel="nofollow">http://i.imgur.com/wwbLU6g.png</a>]:> \ am also a mp pmfesslonzl on Thummack wmcn Is a sue 1m peop‘e \ookmg (or professmna‘ semces We on glg salad P‘ezse see my rewews 1mm my cuems were as weH

评论 #12694592 未加载

评论 #12695042 未加载

评论 #12695202 未加载

评论 #12695183 未加载

评论 #12697270 未加载

评论 #12694600 未加载

iplawover 8 years ago

HOW is there not a better, almost 100% accurate OCR tool?I routinely (daily) need to OCR PDF files. The PDF files are not scans. They are PDF files created from a Word file. The text is 100% clear, the lines are 100% straight, and the type is 100% uniform.And, yet, Microsoft and Google OCR spits out gibberish that is full of critical errors.From a problem solving perspective, this seems like an incredibly easy problem to solve in this exact use case. That is, PDFs generated from text files. Identify a uniform font size (prevent o-to-O and o-to-0 errors), identify a font-family (serif, sans-serif, narrow to particular fonts), and OCR the damn thing. And yet, the output is useless in my field.

评论 #12697101 未加载

评论 #12696398 未加载

评论 #12696720 未加载

评论 #12697626 未加载

评论 #12696665 未加载

jameslkover 8 years ago

For all those claiming issues with reading text from a screen shot of this page, note that this is more an issue with the original Tesseract library, not this library (which appears to wrap Tesseract compiled through Emscripten). I remember having a similar issue when I used the original Tesseract. The quick hack I found to fix it was to rescale any small text input images 3x first before feeding it to Tesseract. I'm sure there's more intelligent solutions to mitigate that problem.

评论 #12696189 未加载

AgentMEover 8 years ago

Why the promise-like interface? If it returned a promise with a this-returning progress method monkey-patched onto it, then you could use it otherwise like a regular promise:<pre><code> Tesseract.recognize(myImage) .progress(function(message){console.log(message)}) .then(function(result){console.log(result)}) .catch(function(err){console.error(err)}); </code></pre> or<pre><code> Tesseract.recognize(myImage) .progress(function(message){console.log(message)}) .then( function(result){console.log(result)}, function(err){console.error(err)} ); </code></pre> I guess I just still have bad memories of jQuery's old almost-like-real promises. I'd rather never have to think ever again about whether I'm dealing with a real promise or one that's going to surprise me and break at run-time because I tried to use it like a real one.

评论 #12695611 未加载

greenpizza13over 8 years ago

Excited about this... but the OCR quality seems to be very bad. Maybe it's not optimized for recognizing black text on a white background.For example, I took a screenshot of this comment and ran it through the demo and got this:Excited ehent this... but the OCR enenty Seems te be very bad. Maybe it's het Dptimized far recngnizing black text an e white heckgmnhe. EDI example, 1 tank e Screenshnt at this cement ehe teh it. thmneh the den» ehd get this:It seems to recognize the bounding boxes just fine but mangles the words.

评论 #12694549 未加载

goatslackerover 8 years ago

I've been using this library to read screenshots of Pokemon Go to automatically calculate Individual Values for each Pokemon[1] It's worked great on desktop, but on mobile safari where it matters most the library causes the browser to crash :(1: <a href="https://github.com/goatslacker/pokemon-go-iv-calculator/blob/master/web/components/PictureUpload.js" rel="nofollow">https://github.com/goatslacker/pokemon-go-iv-calculator/blob...</a>

评论 #12700375 未加载

userbinatorover 8 years ago

Tesseract was one of the best publicly-available CAPTCHA solvers when I was playing around with that stuff a few years ago; I remember somewhere in the neighbourhood of 90%+ accuracy on ReCAPTCHA, no wonder they've changed those considerably since then to make it difficult even for humans.

gentleteblorover 8 years ago

I've always wanted to use Tesseract on .NET projects but it was always clumsy (wrappers). This looks like it'll make things easier.Thanks for putting this out.

评论 #12696699 未加载

yankyouover 8 years ago

> Drop an English image on this page to OCR it!This looks great, and I'd really love to but> Uncaught ReferenceError: progress is not definedEDIT: works now!

评论 #12694165 未加载

mdaniover 8 years ago

Languages list link is broken - getting 404 for the following <a href="https://github.com/naptha/tesseract.js/blob/master/tesseract_lang_list.md" rel="nofollow">https://github.com/naptha/tesseract.js/blob/master/tesseract...</a>

评论 #12694286 未加载

zelon88over 8 years ago

Does this mean I can implement Tesseract on my home server without using php's shell_exec to perform magic on my files? I can just use Jscript instead? Cool!My current HRCloud2 project could benefit greatly if I ever get around to it. Currently I make the php interpreter jump through hoops and move stuff all over the place to OCR images and docs. This could save a ton of time and shift the processing to the client instead of my server.

评论 #12698130 未加载

KiwiCoderover 8 years ago

Impressive that this is pure JS, however trying an image cut from the page itself gave this result> Dropan Enghsh Wage on (Ms page to OCR mShould be> Drop an English image on this page to OCR it!

评论 #12701115 未加载

评论 #12694447 未加载

daliwaliover 8 years ago

The title and description are very misleading: this is technically "pure JavaScript" but the JS is compiled from the original C++ library of the same name using emscripten. I think "pure JS" would imply that all of its sources are written in JS which is not the case here. It's mostly the C++ code doing the actual work, with a little JS wrapper on top.

slajaxover 8 years ago

Pretty cool. I screen captured the text in the bottom right corner of the page and it had some issues. Here's a screenshot of what happened: <a href="http://io.kc.io/hkeM" rel="nofollow">http://io.kc.io/hkeM</a>

mgalkaover 8 years ago

Awesome! The ability to OCR video in a browser opens up so many interesting possibilities.

jaytaylorover 8 years ago

For those who may be interested;I threw together a quick proof-of-concept in Go for exposing tesseract via a web API:<a href="https://github.com/jaytaylor/tesseract-web" rel="nofollow">https://github.com/jaytaylor/tesseract-web</a>

zhte415over 8 years ago

Does this include taking a text and for example, when viewing it, 'wiping' the text in the logical native language order?For languages that don't employ much whitespace, this would be nice.

artfover 8 years ago

Sorry guys, probably a stupid question (googled quickly, doesn't worked), but does this kind of stuff involve ML? Do I need to train it?

评论 #12699204 未加载

maaaatsover 8 years ago

Does it block while it works and do the work in several setTimeouts or how do they get it to report progress without freezing everything?

评论 #12695971 未加载

codemodeover 8 years ago

Is it true, that original implementation of tesseract exexuted from commandline is faster than javascript translated version?

评论 #12710194 未加载

ckluisover 8 years ago

What License? Doesn't mention it.

评论 #12694875 未加载

评论 #12694309 未加载

mrcactu5over 8 years ago

Tesseract is not specific to JavaScript right? I do recall there being a version for Python

评论 #12695166 未加载

z3t4over 8 years ago

More instructions, like how to train it, would be nice.

niutechover 8 years ago

How does it compare with Ocrad.js?

newtons_bodkinover 8 years ago

How long did this take to build?

sanketbajoriaover 8 years ago

Awesome

employee8000over 8 years ago

Is this at all affiliated with the already-existing tesseract OCR library? It doesn't seem to be from my cursory check so if not you need to rename your library, because you're ripping off their name.<a href="https://github.com/tesseract-ocr/tesseract" rel="nofollow">https://github.com/tesseract-ocr/tesseract</a>

评论 #12694294 未加载

评论 #12694291 未加载