TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Introducing Tabula, a human-friendly PDF-to-CSV data extractor

137 点作者 mtigas大约 12 年前

13 条评论

cs702大约 12 年前
Love it: a wonderful gift to millions of students, analysts, journalists, researchers, and others who for many years have had to extract data from PDFs via throwaway scripts, copy-and-paste, or (yikes) read-and-retype.
polskibus大约 12 年前
If they automate table detection, then many low-end "analysts" will be made redundant. PDFs one of the worst bits for data feed automation.
评论 #5488376 未加载
评论 #5488309 未加载
danso大约 12 年前
Great work, the integration (as shown in the demo) and UX are really well done. A couple of questions:<p>1) Why use Python for OpenCV when Ruby has a decent wrapper that can do Hough (<a href="https://github.com/ruby-opencv/ruby-opencv" rel="nofollow">https://github.com/ruby-opencv/ruby-opencv</a>)? Or was the Ruby version just too buggy still?<p>2) Is there a command-line version planned? I guess it'd be most relevant once auto-detection is figured out.
评论 #5488539 未加载
saddino大约 12 年前
Wow, nice work! I'm the author Trapeze, a once-shareware (now freeware and open source) PDF-to-Word/RTF/HTML/PlainText application for OS X. My approach was similar: trying to squash characters into words via a logical grid to determine whitespace. My #1 request from customers was to extract tables and I never had the guts to attempt it. :-)<p>(For those interested, you can grab Trapeze from mesadynamics.com -- requires OS X 10.4; source code is a mixture of C++ and Objective-C).
xaritas大约 12 年前
I probably could have used this recently when I had a project which required a close encounter with extracting data from PDFs. Fortunately the PDFs were generated as a report by a VB6 application (!) so they had a fairly regular format once I figured out the quirks of PDF, as the authors describe here.<p>I did learn a few neat tricks by doing it myself though. The library I used to extract the text was none other than Mozilla's own PDF.js, so in the final version my users could just drag and drop the PDF onto the browser window, and my little algorithm parsed the tables into arrays, with AngularJS rendering them as HTML tables.<p>Obviously computer-vision assisted, general purpose reconstruction of tabular data is the secret sauce in this project, but if you have the right use case you can do some cool things in the client. You do have to dig into the PDF.js internals a bit to figure out how to use it but I'm sure that it will improve in that respect.
manicbovine大约 12 年前
I wish I'd read this an hour ago, before I wrote a series of terrible awk, perl, and bash scripts to process several thousand inconsistently formatted pdfs.<p>edit: Nevermind, it wouldn't have helped. I missed the part where automation isn't yet supported. Either way, this looks like a great tool.
nsp大约 12 年前
This is fantastic, would saved me dozens of hours as an econ undergraduate.<p>Semirelated: I used to have a ton of scanned journal articles that I wanted to be able to read on a kindle without having to scroll across every page, and came across k2pdfopt. It's a C script that finds word and line breaks in image based pdfs and rearranges the text so that they'll fit on smaller screens. It's got a ton of flags you can set and is pretty good and ignoring/cropping out headers and footers and dealing with pages scanned at an angle. <a href="http://www.willus.com/k2pdfopt/help/k2menu.shtml" rel="nofollow">http://www.willus.com/k2pdfopt/help/k2menu.shtml</a> No affiliation with Willus
migbac大约 12 年前
I am starting a personal project to convert my University schedules from pdf to an ICS calendar, I'm so glad I heard about Tabula, but like previously said a command line version would just be wonderful.
stcredzero大约 12 年前
This is very cool!<p>Has this kind of thing been done for PDF map data?<p>I was talking with a friend of mine a month ago about the dismal state of official crime incidence websites. They're usually just lists of PDFs, probably because whoever is responsible for the data just uses whatever MS Word PDF output is available to the office and posts an existing monthly report as a PDF. This makes online crime data a huge pain in the #ss to decipher.<p>I'm sure there's a lot of geographic data this could apply to.
leeoniya大约 12 年前
this is neat. i'm also doing pdf rasterization and pretty extensive document analysis in html5 &#60;canvas&#62;, not just tables. unfortunately it's for an internal tool which will likely form the core of our business but the base library i wrote and use for it is open sourced at <a href="https://github.com/leeoniya/pXY.js" rel="nofollow">https://github.com/leeoniya/pXY.js</a><p>tute and demos are here: <a href="http://o-0.me/pXY/" rel="nofollow">http://o-0.me/pXY/</a> , some recent commits like radial scanning aren't documented very well yet but i'll devote some time to it if anyone needs those. they're mostly useful for interactive analysis.<p>with some creative algorithms, typed arrays and web workers the speed is pretty amazing (for something built in js at least). a 1550x2006 pixel document page analyzes in 1.1s in chrome.
alanreid大约 12 年前
This is just awesome! Well done!
jonjohn84大约 12 年前
Tabula is also the name of a programmable logic company doing fpga-like "3PLDs" where the design implemented varies over time to increase effective size of the logic fabric. (tabula.com)
bnp大约 12 年前
Awesome - have needed this so often.