TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Extract Table from Image

217 pointsby v3gasover 3 years ago

17 comments

w-mover 3 years ago
I&#x27;m answering questions about Pandas (the Python data analysis framework) on StackOverflow from time to time. It&#x27;s an exercise in patience, because many people will post screenshots of their data instead of a reproducible code example. You&#x27;ll have to point about every other newcomer to the documentation on how write a proper question that one can actually answer.<p>I&#x27;d imagine other areas around StackOverflow (SQL, R?) are fighting similar issues. I&#x27;ve just tried it with a question (sure enough the second newest Pandas tagged question had a table as an image), and your tool produced a nice .csv.<p>It would be a godsend to have a button on StackOverflow that would replace a user-uploaded image of a table with some Pandas code that constructs the same DataFrame. Currently I would have to download the image, upload it to extract-table.com, download the .csv, load it into Python, run some code to create the code-based DataFrame.<p>I&#x27;d consider sending people on StackOverflow to your tool if you cut down some of the steps: (1) allowing to paste in an URL of an image, and (2) producing Pandas code output that can be directly copy&#x2F;pasted from the site (not having to download a csv).<p>For illustration: here&#x27;s what the Pandas code would look like for the first example of extract-table.com:<p><pre><code> df = pd.DataFrame( {&#x27;Name&#x27;: {0: &#x27;David&#x27;, 1: &#x27;Jessica&#x27;, 2: &#x27;Warren&#x27;}, &#x27;Gender&#x27;: {0: &#x27;Male&#x27;, 1: &#x27;Female&#x27;, 2: &#x27;Male&#x27;}, &#x27;Age&#x27;: {0: 23, 1: 47, 2: 12}} )</code></pre>
评论 #28686511 未加载
评论 #28687102 未加载
评论 #28686170 未加载
评论 #28691723 未加载
评论 #28731961 未加载
greaterwebover 3 years ago
Nice work putting together this tool. Have you seen either Spark OCR[1] from John Snow Labs or the Adobe PDF Extract API[2]? They both do a pretty good job a data extraction from tables as well.<p>[1] <a href="https:&#x2F;&#x2F;www.johnsnowlabs.com&#x2F;spark-ocr&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.johnsnowlabs.com&#x2F;spark-ocr&#x2F;</a><p>[2] <a href="https:&#x2F;&#x2F;www.adobe.io&#x2F;apis&#x2F;documentcloud&#x2F;dcsdk&#x2F;pdf-extract.html" rel="nofollow">https:&#x2F;&#x2F;www.adobe.io&#x2F;apis&#x2F;documentcloud&#x2F;dcsdk&#x2F;pdf-extract.ht...</a>
评论 #28731736 未加载
MattGaiserover 3 years ago
Pair this with a snipping tool and all sorts of people in banking would use it for a few hours a day, especially if it could paste to Excel or at least fill the clipboard in a way pastable to Excel.<p>I used to work for a bank on their innovation team and pitched basically this, but as an intern I had neither the skill nor time to do it. But it was certainly something a bunch of people internally wanted.
评论 #28731731 未加载
评论 #28691006 未加载
评论 #28691345 未加载
nanisover 3 years ago
With this image[1] from this question on SO[2], the output[3] is missing the last row. FWIW, I&#x27;ve had the occasional miraculous-looking results from AWS Textract, but you do need to keep an eye on what&#x27;s happening.<p>Update: I just checked a bit carefully, and this example[4] is also missing the last row.<p>Also, Danish ø seems problematic on your web page whereas the CSV has the right UTF-8 encoded bytes.<p>[1]: <a href="https:&#x2F;&#x2F;i.stack.imgur.com&#x2F;y7Zrt.png" rel="nofollow">https:&#x2F;&#x2F;i.stack.imgur.com&#x2F;y7Zrt.png</a><p>[2]: <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;q&#x2F;69363708&#x2F;100754" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;q&#x2F;69363708&#x2F;100754</a><p>[3]: <a href="https:&#x2F;&#x2F;results.extract-table.com&#x2F;8d4818867ad604792819e98808ca447d2e1d33b3f69817a475a2d05c7a932e8e" rel="nofollow">https:&#x2F;&#x2F;results.extract-table.com&#x2F;8d4818867ad604792819e98808...</a><p>[4]: <a href="https:&#x2F;&#x2F;results.extract-table.com&#x2F;254d95722a2c2b1df72fc26b59925ef94d5c91017a661a194d76f1a52e228634" rel="nofollow">https:&#x2F;&#x2F;results.extract-table.com&#x2F;254d95722a2c2b1df72fc26b59...</a>
评论 #28731709 未加载
eihliover 3 years ago
Nice. I worked on something similar but far less robust: <a href="https:&#x2F;&#x2F;github.com&#x2F;eihli&#x2F;image-table-ocr" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;eihli&#x2F;image-table-ocr</a>. It fails to find the tables on the example images at extract-table.com, but the code is heavily commented at <a href="https:&#x2F;&#x2F;eihli.github.io&#x2F;image-table-ocr&#x2F;pdf_table_extraction_and_ocr.html" rel="nofollow">https:&#x2F;&#x2F;eihli.github.io&#x2F;image-table-ocr&#x2F;pdf_table_extraction...</a> so there&#x27;s high visibility into what&#x27;s going on and what needs to change to get it to work with images of different sizes&#x2F;fonts.
BrandiATMuhkuhover 3 years ago
This is really awesome. I have tried to solve that many times. I got close, with open CV and azure ML. I have even tried AWS Textract (~2 years ago). But this is the best implementation I have seen so far. Congratulations.<p>I&#x27;m not sure what application you are thinking off. But the reason I&#x27;m following this problem is UX. Years ago, I worked on a project where anyone can add product prices into a DB. They do that by typing their receipt (line items) into the DB. The major issue was, the UX was horrible.<p>With an API like yours, this is super simply. One photo. That&#x27;s all.<p>Maybe I&#x27;ll revisit it as a side project.
评论 #28731697 未加载
BillSaysThisover 3 years ago
Really nice but... wondering how long this will last as a free tool given AWS fees.
whirlwinover 3 years ago
Nice. Fun fact: The third example table is an ordered list of Norway&#x27;s richest people (according to net worth, I think)
howmayiannoyyouover 3 years ago
Nice job. Actually though, what the world really needs in ML that divines the trend and perhaps indices&#x2F;values from images of charts.
评论 #28687822 未加载
pveierlandover 3 years ago
Neat tool! There appears to be two minor issues in the last example. There is an encoding issue of &quot;ø&quot; characters (&quot;Røkke&quot;), and a column split appears to be missing betweeen the closely spaced numbers (&quot;33 300 22 700&quot; vs &quot;33 300,22 700&quot;). Possible possibly non-trivial improvement: harmonize formatting within the same column to avoid mixed occurences of &quot;7800&quot; &#x2F; &quot;7 800&quot;.
mzsover 3 years ago
<a href="https:&#x2F;&#x2F;github.com&#x2F;vegarsti&#x2F;extract-table" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;vegarsti&#x2F;extract-table</a>
jnsieover 3 years ago
Really cool. I&#x27;m interested to hear your plans for this. Are you planning to offer as a service&#x2F;open source&#x2F;etc.?
visargaover 3 years ago
Does it also do table detection in a larger image and header&#x2F;body classification?
评论 #28731673 未加载
ducktectiveover 3 years ago
Awesome project!<p>Can AWS Textract be used directly with curl to return text strings of an uploaded image?
评论 #28731653 未加载
z3t4over 3 years ago
Should make it into a browser plugin, so annoying when web sites have tables in images.
basmangoover 3 years ago
Does it use textract directly? Or are you doing some preprocessing?
评论 #28698699 未加载
tuberelayover 3 years ago
UI Path does this in a nice way