Show HN: Extract Table from Image

217 pointsby v3gasover 3 years ago

17 comments

w-mover 3 years ago

I'm answering questions about Pandas (the Python data analysis framework) on StackOverflow from time to time. It's an exercise in patience, because many people will post screenshots of their data instead of a reproducible code example. You'll have to point about every other newcomer to the documentation on how write a proper question that one can actually answer.I'd imagine other areas around StackOverflow (SQL, R?) are fighting similar issues. I've just tried it with a question (sure enough the second newest Pandas tagged question had a table as an image), and your tool produced a nice .csv.It would be a godsend to have a button on StackOverflow that would replace a user-uploaded image of a table with some Pandas code that constructs the same DataFrame. Currently I would have to download the image, upload it to extract-table.com, download the .csv, load it into Python, run some code to create the code-based DataFrame.I'd consider sending people on StackOverflow to your tool if you cut down some of the steps: (1) allowing to paste in an URL of an image, and (2) producing Pandas code output that can be directly copy/pasted from the site (not having to download a csv).For illustration: here's what the Pandas code would look like for the first example of extract-table.com:<pre><code> df = pd.DataFrame( {'Name': {0: 'David', 1: 'Jessica', 2: 'Warren'}, 'Gender': {0: 'Male', 1: 'Female', 2: 'Male'}, 'Age': {0: 23, 1: 47, 2: 12}} )</code></pre>

评论 #28686511 未加载

评论 #28687102 未加载

评论 #28686170 未加载

评论 #28691723 未加载

评论 #28731961 未加载

greaterwebover 3 years ago

Nice work putting together this tool. Have you seen either Spark OCR[1] from John Snow Labs or the Adobe PDF Extract API[2]? They both do a pretty good job a data extraction from tables as well.[1] <a href="https://www.johnsnowlabs.com/spark-ocr/" rel="nofollow">https://www.johnsnowlabs.com/spark-ocr/</a>[2] <a href="https://www.adobe.io/apis/documentcloud/dcsdk/pdf-extract.html" rel="nofollow">https://www.adobe.io/apis/documentcloud/dcsdk/pdf-extract.ht...</a>

评论 #28731736 未加载

MattGaiserover 3 years ago

Pair this with a snipping tool and all sorts of people in banking would use it for a few hours a day, especially if it could paste to Excel or at least fill the clipboard in a way pastable to Excel.I used to work for a bank on their innovation team and pitched basically this, but as an intern I had neither the skill nor time to do it. But it was certainly something a bunch of people internally wanted.

评论 #28731731 未加载

评论 #28691006 未加载

评论 #28691345 未加载

nanisover 3 years ago

With this image[1] from this question on SO[2], the output[3] is missing the last row. FWIW, I've had the occasional miraculous-looking results from AWS Textract, but you do need to keep an eye on what's happening.Update: I just checked a bit carefully, and this example[4] is also missing the last row.Also, Danish ø seems problematic on your web page whereas the CSV has the right UTF-8 encoded bytes.[1]: <a href="https://i.stack.imgur.com/y7Zrt.png" rel="nofollow">https://i.stack.imgur.com/y7Zrt.png</a>[2]: <a href="https://stackoverflow.com/q/69363708/100754" rel="nofollow">https://stackoverflow.com/q/69363708/100754</a>[3]: <a href="https://results.extract-table.com/8d4818867ad604792819e98808ca447d2e1d33b3f69817a475a2d05c7a932e8e" rel="nofollow">https://results.extract-table.com/8d4818867ad604792819e98808...</a>[4]: <a href="https://results.extract-table.com/254d95722a2c2b1df72fc26b59925ef94d5c91017a661a194d76f1a52e228634" rel="nofollow">https://results.extract-table.com/254d95722a2c2b1df72fc26b59...</a>

评论 #28731709 未加载

eihliover 3 years ago

Nice. I worked on something similar but far less robust: <a href="https://github.com/eihli/image-table-ocr" rel="nofollow">https://github.com/eihli/image-table-ocr</a>. It fails to find the tables on the example images at extract-table.com, but the code is heavily commented at <a href="https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html" rel="nofollow">https://eihli.github.io/image-table-ocr/pdf_table_extraction...</a> so there's high visibility into what's going on and what needs to change to get it to work with images of different sizes/fonts.

BrandiATMuhkuhover 3 years ago

This is really awesome. I have tried to solve that many times. I got close, with open CV and azure ML. I have even tried AWS Textract (~2 years ago). But this is the best implementation I have seen so far. Congratulations.I'm not sure what application you are thinking off. But the reason I'm following this problem is UX. Years ago, I worked on a project where anyone can add product prices into a DB. They do that by typing their receipt (line items) into the DB. The major issue was, the UX was horrible.With an API like yours, this is super simply. One photo. That's all.Maybe I'll revisit it as a side project.

评论 #28731697 未加载

BillSaysThisover 3 years ago

Really nice but... wondering how long this will last as a free tool given AWS fees.

whirlwinover 3 years ago

Nice. Fun fact: The third example table is an ordered list of Norway's richest people (according to net worth, I think)

howmayiannoyyouover 3 years ago

Nice job. Actually though, what the world really needs in ML that divines the trend and perhaps indices/values from images of charts.

评论 #28687822 未加载

pveierlandover 3 years ago

Neat tool! There appears to be two minor issues in the last example. There is an encoding issue of "ø" characters ("RÃ¸kke"), and a column split appears to be missing betweeen the closely spaced numbers ("33 300 22 700" vs "33 300,22 700"). Possible possibly non-trivial improvement: harmonize formatting within the same column to avoid mixed occurences of "7800" / "7 800".

mzsover 3 years ago

<a href="https://github.com/vegarsti/extract-table" rel="nofollow">https://github.com/vegarsti/extract-table</a>

jnsieover 3 years ago

Really cool. I'm interested to hear your plans for this. Are you planning to offer as a service/open source/etc.?

visargaover 3 years ago

Does it also do table detection in a larger image and header/body classification?

评论 #28731673 未加载

ducktectiveover 3 years ago

Awesome project!Can AWS Textract be used directly with curl to return text strings of an uploaded image?

评论 #28731653 未加载

z3t4over 3 years ago

Should make it into a browser plugin, so annoying when web sites have tables in images.

basmangoover 3 years ago

Does it use textract directly? Or are you doing some preprocessing?

评论 #28698699 未加载

tuberelayover 3 years ago

UI Path does this in a nice way