Ask HN: OCR Libraries for Receipt Scanning/Parsing?

69 pointsby selbykabout 4 years ago

I'm interested in keeping tabs on my spending and comparing prices of items I buy at grocery stores, because I tend to not think about it when I need something. I am conscious of the extreme price discrepancies for the exact same items at stores just blocks apart here in NYC, but it's difficult to keep track of the prices of each item at various places to optimize shopping.I want to build a system that can keep a running tab of my purchases by item, price, and store. I need to find a library that can effectively scan a receipt, recognize the store (usually name, number, address and logo at the top), and differentiate each item label and its price. I plan to manually tag each item label from a store's receipt with the item's barcode the first time it is seen.I have been sporadically googling the past 6 months but am still unsure which OCR library(s) I should invest my time in. Or how low level I should start. Should I grab a library like tesseract and do my own feature extraction or libs that spit out semi-structured objects with text and hope it returns something similar enough across store receipts to make sense of consistently?I'm ok with this being an extended project, but I would like some input on choosing a solid library with accurate OCR and advice on how to approach training/parsing from someone with more experience.Other solutions and advice are also welcome++

24 comments

ampdepolymeraseabout 4 years ago

If you need 99%+ accuracy go for AWS Mechanical Turk. They are used by Wave Accounting and other office application companies for receipt OCR. For 85-95%+ accuracy any off the shelf solution like Google Cloud ML APIs or AWS textract will be fine. You can get better results with both the cloud APIs and hand rolled ML models if you have a good dataset. For this sort of applications a large quantity of well annotated data is king. If you only have <100 receipts per year and need very high accuracy it might be cheaper to just go with AWS Mechanical Turk end-to-end. You have to pay people to annotate the data anyways if you want to train a model so it might be easier to just stick with humans.

评论 #26685801 未加载

sandreasabout 4 years ago

Maybe this is helpful: <a href="https://nanonets.com/blog/receipt-ocr/" rel="nofollow">https://nanonets.com/blog/receipt-ocr/</a>In my Opinion Tesseract is the most sophisticated "free" OCR solution out there. The problem with Tesseract is not its recognition capabilities, but more the preprocessing steps.<pre><code> - thresholding - deskewing - segmentation - ... </code></pre> There is a C# library (non-free), that improves recognition A LOT, just by providing these abilities: <a href="https://www.vintasoft.com/vsocr-dotnet-index.html" rel="nofollow">https://www.vintasoft.com/vsocr-dotnet-index.html</a>If you find a good Open Source solution, I would be interested, too...

ivan_ahabout 4 years ago

Here are some links and POC code on this problem:slides: <a href="http://slides.com/rolisz/receiptbudget#/1" rel="nofollow">http://slides.com/rolisz/receiptbudget#/1</a>code: <a href="https://github.com/rolisz/receipt_budget" rel="nofollow">https://github.com/rolisz/receipt_budget</a>research article: <a href="https://www.authorea.com/users/6050/articles/6335-a-novel-machine-learning-based-approach-for-retrieving-information-from-receipt-images/_show_article" rel="nofollow">https://www.authorea.com/users/6050/articles/6335-a-novel-ma...</a>

评论 #26687651 未加载

jkaabout 4 years ago

For "middle ground" projects like this (criteria: a common enough problem that lots of people _should_ have thought about it -- but it may not be a lucrative core business area -- and there aren't any household-name open source projects that cover it), I often turn to GitHub repository search to see what's available.Based on that, your best bet might be <a href="https://github.com/ReceiptManager/receipt-parser-legacy" rel="nofollow">https://github.com/ReceiptManager/receipt-parser-legacy</a>, which is a Python library built on top of the Tesseract OCR engine. You can use it containerized, in Android/iOS applications, or via your own Python scripts.

评论 #26685035 未加载

roliszabout 4 years ago

I worked on such a project 8 years ago. I actually ended up building my own OCR engine, after annotating manually about 50 receipts (about 8000 characters if I remember correctly). Some of the problems I encountered back then is that snapping a picture of a receipt with your phone will result in weird lighting conditions and angles which will mess with the OCR engine. The second problem is that it's hard to keep the receipt straight while taking the picture, so it will be hard to identify lines in the picture, because they will be curved.To some extent, all this is solved by some modern APIs, such as what GCP or AWS offer, for doing OCR for you. But as far as I know, there is still one more challenge: interpreting the text. Inferring what each line is, what's the price for which item (some receipts have the price on the same line, some on the next line, some above) is quite hard. I tried to do it with rules (regexes and lots of ifs), but even a 95% accuracy of the OCR engine will trip you up.You can probably frame this as an ML problem as well, but I don't think you'll find any datasets for this.

评论 #26684609 未加载

评论 #26684678 未加载

jonahbentonabout 4 years ago

As others have suggested- this is not a project where stitching together OSS OCR bits is going to yield anywhere near useful results. Overall at multiple levels of the stack the error bars on the tech bits are really wide and narrowing them is still a research project. This is why most of the suggestions are- if you want a workable solution, brute-force cheap human Mechanical Turk is the only option.However, if you are looking for a project, picking one grocery store with one receipt format and generally limited/consistent product coding schemes is a reasonable thing to plug away on. Speaking personally I did this with Whole Foods receipts for a while and was able to get to almost, kinda usable. But then the pandemic hit and I started ordering delivery which obviates the whole receipt ingestion thing because I can get all those details directly from Amazon (modulo doing some data scraping).Analytics on food purchases are a tremendously interesting and deeply underexplored space in which there is lots of future commercial potential.

wcarssabout 4 years ago

I've had friends work at Sensibill[1] which sells tools (mostly to banks) to build some of what you're imagining having right into banking+expense tracking apps. Not sure if they have anything à la carte but they might have something of value to look at.1 - <a href="https://getsensibill.com/" rel="nofollow">https://getsensibill.com/</a>

xnxabout 4 years ago

If you're set on building your own, you're probably not interested in using this: <a href="https://blog.google/technology/area-120/stack" rel="nofollow">https://blog.google/technology/area-120/stack</a> , but it might be a useful reference.

评论 #26685068 未加载

评论 #26687669 未加载

mklabout 4 years ago

I've been experimenting with using tesseract to get information out of scanned tutorial roll sheets, with surprising success. If you ask it for tsv or hocr output, it will give you a bounding box for each word. To extract a student's attendance information, I grep the tsv files for a student ID number or name, get the y position with sed, and combine slices of the page images with Image Magick (in my case I want to see all the handwritten ticks and numbers). You might be able to do something similar looking for numbers on the same line as key words like "Total" or "apples" or whatever. Some of your success will depend on how well you scan the receipts.

ephbitabout 4 years ago

Makes me think of this idea I had: the receipt printer developers should just add a feature to allow printing of a QR code that contains all relevant information on the receipt in CSV format. Customers could choose whether they want one and be charged a small fee or if they're fine without.Unless there's some pressure through government regulation to implement this, it won't happen though ... because who's least interested in customers comparing prices and having transparency in their spendings? The retailers obviously.

vereloabout 4 years ago

I was the tech founder at a company that built this exact technology. checkout51.com (still running but we sold it and I've since moved on)If you want to chat feel free to reach out, i could talk all day about this stuff.

pjc50about 4 years ago

I had this idea a while ago, tried a number of libraries include Tesseract, and found all the results extremely poor. Be interested to see if one that works is suggested.

评论 #26684878 未加载

phenkdoabout 4 years ago

I found easyocr to be the most accurate. Tesseract was meh.

screyeabout 4 years ago

Microsoft's Form Recognizer is pretty good. (<a href="https://docs.microsoft.com/en-us/azure/cognitive-services/form-recognizer/overview?tabs=v2-1" rel="nofollow">https://docs.microsoft.com/en-us/azure/cognitive-services/fo...</a>)discl@imer - I verk 4 not-Macro-Hard. But, I have no connection to this team.edit: this might be terribly extra for personal use.

eastendguyabout 4 years ago

For free ocr and quick prototyping, I use <a href="https://ocr.space/receiptscanning" rel="nofollow">https://ocr.space/receiptscanning</a> - It is easy to use and has a generous free tier of 25,000 free scans each month.Having said that, I am sure there must be some existing accounting software with built-in OCR? Probably even an app?

wyiskeabout 4 years ago

I built an app to scan receipts for bill splitting, although your use case is certainly interesting.Google‘s MLKit is very accurate for on device recognition. You can even feed frames straight from the camera with almost real time results. Your bigger problem will be parsing the results, and handling very inconsistent receipts.

villgaxabout 4 years ago

If you have the time then go for MLKit or any other OCR API, tesseract is pathetic for non-scanned/in the wild images, then put your parsers atop of the OCR output.If time is of the essence simply use AWS Textract & be done with its free tier.

scandoxabout 4 years ago

Well you should evaluate ABBYY to see how well it performs as it is one of most widely used commercial applications for OCR.I used it for years to scan our bank statements (before our bank could export data).It was the only thing I ever found that handled tabular data properly.

gitowiecabout 4 years ago

I'm familiar with Camelot, it is used by UI called Excalibur. It is more intended to scan invoices or bank statements. It is perfect for tabularized data. It can handle tables without explicit column edges.

dudusabout 4 years ago

Google launched an Android app called Stacks. It's out of their area120 so it's not a fully supported product. But it scans and upload to Google drive and does some ocr. It's been pleasant to use.

trbznkabout 4 years ago

We have a similar project and tried AWS Textract and the Google Cloud Vison API. For us it seems that google ocr gives more accurate results. Pricing is nearly the same.

perssontmabout 4 years ago

I recently started using paperless-ng, check it out, perhaps you can build on that. Includes tessarect for ocr for example.

misiti3780about 4 years ago

Use textract. Super easy to integrate and results are pretty impressive. Also, it is cheap.

MattGaiserabout 4 years ago

Why not just use Mechanical Turk? You can get receipts done for pennies.

评论 #26684644 未加载