科技回声

7 条评论

If anyone wants to use this for something public-service oriented:Chicago is running for the 2016 Olympic games. About a month ago they released their official "bid book" in PDF form. The local papers gave it a look and wrote some fine stories, but a bunch of local journalists (myself among them) would like to extract the thing out into a Wiki so people could discuss and annotate it instead of just reading it in PDF form.Link to the bid book: <a href="http://www.chicago2016.org/our-plan/bid-book/bid-book.aspx" rel="nofollow">http://www.chicago2016.org/our-plan/bid-book/bid-book.aspx</a>We were thinking of using MediaWiki as the wiki engine. One of us is currently running (the excellent) Chicago Elections Wiki over at <a href="http://chicagoelections.pbwiki.com/" rel="nofollow">http://chicagoelections.pbwiki.com/</a>We'd host, promote, annotate and fill out the wiki, the important thing is to move this from a pdf to an interactive, scannable, hypertext format so people can tear it apart.We'd been talking about sneaking into PyCon and asking around if anyone there would be interested in working on this. It looks like this PDF miner is the start of something that could do this.

评论 #537946 未加载

评论 #538214 未加载

评论 #538112 未加载

bd大约 16 年前

I used it recently for analysis of PDF articles.It's quite good, though as it is written in pure Python, it's rather slow (especially compared to command line tools written in C/C++).I strongly recommend using Psyco [1]. Adding few lines of code cut my PDF->HTML conversion times by half.Also, be warned that markup it produces can be very heavy. Depending on how PDF is structured, you can finish with huge amount of DOM elements.-----[1] <a href="http://psyco.sourceforge.net/" rel="nofollow">http://psyco.sourceforge.net/</a>

latortuga大约 16 年前

For our startup we had a huge integration project with an industry-specific PDF and so I ended up writing a PDF importer that sounds like it does something similar to this project. The best part is that I couldn't figure out how to get my reader to determine what page a specific set of coordinates was on and it looks like this library supports it - thanks for the link!

jpcx01大约 16 年前

Looks interesting. Any good ruby alternatives?

评论 #538380 未加载

albertsun大约 16 年前

Nice stuff. So many public documents are released in PDF format instead of an easy to work with plain text format.

mahmud大约 16 年前

Does anyone know if something like this exists for C? It would be nice to be able to call it from $LANGUAGE.

评论 #537642 未加载

sketerpot大约 16 年前

AAAAAAGHFGUREH!!! I had to write my own a few months ago, which sucked. If I had known about this, I could have been saved a lot of effort. Noooooo!Technology moves forward, I see.

7 条评论

brandnewlow大约 16 年前

评论 #537946 未加载

评论 #538214 未加载

评论 #538112 未加载

bd大约 16 年前

latortuga大约 16 年前

jpcx01大约 16 年前

Looks interesting. Any good ruby alternatives?

评论 #538380 未加载

albertsun大约 16 年前

Nice stuff. So many public documents are released in PDF format instead of an easy to work with plain text format.

mahmud大约 16 年前

Does anyone know if something like this exists for C? It would be nice to be able to call it from $LANGUAGE.

评论 #537642 未加载

sketerpot大约 16 年前

AAAAAAGHFGUREH!!! I had to write my own a few months ago, which sucked. If I had known about this, I could have been saved a lot of effort. Noooooo!Technology moves forward, I see.

PDF Miner

7 条评论

PDF Miner

7 条评论