TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Convert PDF files into structured data

107 pointsby chezmoalmost 9 years ago

14 comments

phononalmost 9 years ago
Is this using something like <a href="https:&#x2F;&#x2F;github.com&#x2F;creatale&#x2F;node-fv" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;creatale&#x2F;node-fv</a> on the backend, which can accommodate various not perfectly scanned forms to data, after you prepare a schema? Or is it a more simplistic &quot;mark hotspots&quot; which won&#x27;t work well&#x2F;at all if if it is not perfectly aligned&#x2F;sized with the original?
评论 #12135290 未加载
darklajidalmost 9 years ago
I&#x27;m working for a company that does DMS Things™ and processing incoming PDFs (for mailroom applications or invoice processing) is one of our core projects. Given that this is the closest submission to my day job ever, I&#x27;m really curious about your project.<p>Your online presentation looks great. The &#x27;layout designer&#x27; if you will, the &#x27;where are important things&#x27; screens look slick.<p>I do wonder how you assign those settings to incoming PDFs though. Is it the user&#x27;s responsibility to say &#x27;This PDF? I told you how&#x2F;from where to extract data before&#x27;? Or do you have some classification system that stuffs the PDFs into buckets (say, by vendor) and templates are assigned to those?<p>How many PDFs that you encounter contain text (vs. scanned&#x2F;image only documents)? For us, while the former certainly rise in popularity, the latter are still far too common&#x2F;more prevalent.<p>Our solution is mostly on-premise so far (online offerings are the current focus of development) and we&#x27;re quite OCR heavy, using a bunch of non-free engines and vote between the results. We also have dynamic templates, allowing rule sets containing rules like &#x27;The total amount is a number satisfying format X, usually right or below a string containing &quot;Total&quot;&#x27; (and our invoice processing solution basically comes with rules like these preconfigured for various countries).<p>Are your templates using absolute coordinates&#x2F;regions? You mention your &#x27;unpaper&#x27; feature - do you fix&#x2F;deskew both images and regions for misaligned pages?<p>(I won&#x27;t mention any company&#x2F;product names, because I don&#x27;t want to advertise or hijack the thread. Nor do I need to connect my HN account ~directly~ with my employer)
评论 #12136169 未加载
评论 #12141921 未加载
evolve2kalmost 9 years ago
Looks get cool, nice work.<p>In your FAQ it says:<p>There are no special requirements. There is nothing to install and you don&#x27;t need any technical know-how for setting up and using &gt;&gt;&gt; mailparser.io.&lt;&lt;&lt; No coding is required.<p>Just pointing out a potential syntax error. Otherwise if it&#x27;s meant to say mailparser better explain what that is.
评论 #12135097 未加载
caseyf7almost 9 years ago
The Zapier integration is why I&#x27;m going to try this one.
unfortunatefacealmost 9 years ago
Save yourself a lot of support time&#x2F;costs and remove the &#x27;free&#x27; option. Your homepage sells the product well and shows its benefits. From the feedback you&#x27;ve already received it looks like you are providing more than $50 worth of value.
sixhobbitsalmost 9 years ago
I&#x27;m always surprised by how well `pdf2text --layout` works for even complicated looking PDFs. Has been better than most specialised (free) web services I&#x27;ve tried
frabcusalmost 9 years ago
Looks really good!<p>A quick advert for PDF Tables <a href="https:&#x2F;&#x2F;pdftables.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;pdftables.com&#x2F;</a> - we&#x27;re a bit more API-focussed.
petraalmost 9 years ago
Depending on how well this works, this could be extremely useful for the electronics industry, where everything is locked in a PDF - allowing someone to build n in-depth research tool that would allow engineers to find the optimal part(using complex queries), from any manufacturer, very fast - far from the broken situation of today, where engineers spend tons of time researching , and often don&#x27;t get tclose to the ideal.
Kinnardalmost 9 years ago
I wonder how their software works. I think there&#x27;s untapped potential in adobe&#x27;s postscript.
评论 #12135080 未加载
camel_Snakealmost 9 years ago
Tried giving this[0] a shot but even just a single page was too large for the 4MB limit.<p>[0] <a href="https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;averageweightofm41fult" rel="nofollow">https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;averageweightofm41fult</a>
jamiecarruthersalmost 9 years ago
I gave it a go and couldn&#x27;t get useful data extracted. I sent a support query with attached PDFs.
markdownalmost 9 years ago
Your pricing tables mention webhooks but the faqs below them don&#x27;t explain what those are.
ruler88almost 9 years ago
nice! I wish I knew about this earlier, I had built a version of this on my own to solve this very problem.
mordaealmost 9 years ago
No source? No, thanks!