TechEcho

12 comments

hnriotalmost 12 years ago

Why not just system(pdf2html) - I don't see the point since this level of functionality is trivially achieved. If it did something over and above that it might be useful, like OCR, but even that's not hard to add.

评论 #5967164 未加载

zdwalmost 12 years ago

If you're doing this local/cli`pdftext`, from <a href="http://www.foolabs.com/xpdf/" rel="nofollow">http://www.foolabs.com/xpdf/</a>For OCR, `pdfimages` (also from xpdf), combined with ImageMagick's `convert`, and `tesseract` (<a href="http://code.google.com/p/tesseract-ocr/" rel="nofollow">http://code.google.com/p/tesseract-ocr/</a>) works passably well.

kijinalmost 12 years ago

I have some questions:1. Why return an array of texts? Where do the texts get split up? At page boundaries? Column boundaries? At the end of each line? If a line is interrupted by a corner of an image and continues a couple of inches afterward, does it get treated as a separate text? (I once used a PDF->text extractor program that spit out every word sepearately, often in an incorrect order. That probably had to do with how the PDF was organized internally.)2. "The PDF file should be smaller than 1 Mbit" -> You mean 1 megabyte, right? Because 1 megabit is only 125-128 kilobytes.

评论 #5965887 未加载

midasalmost 12 years ago

Going from PDF to nicely formatted word doc would be huge for lawyers and people who do a lot of contract negotiations. It's hard to do well though.

评论 #5966725 未加载

评论 #5966276 未加载

评论 #5966257 未加载

rcfoxalmost 12 years ago

I've recently been working on extracting text from PDFs myself. I've found that `pdftohtml -xml` from the Poppler utils does a decent job of it, and includes a bounding box for each piece of text. I've submitted a few patches to their Bugzilla to also include the transformation matrix as well as some extra styling information.

chensteralmost 12 years ago

I googled "converting PDF to text" and "converting PDF to html". A tons of services already exist out there. Apparently, it's not something new. How do you plan to compete? Are you planning to focus on data extraction rather than conversion?

评论 #5966685 未加载

TillEalmost 12 years ago

Neat, but practically who would want to do this with an API rather than installable software?

评论 #5965881 未加载

ismaelcalmost 12 years ago

Hey I've documented this in Mashape - <a href="https://www.mashape.com/ismaelc/extract-text-from-pdfs#!documentation" rel="nofollow">https://www.mashape.com/ismaelc/extract-text-from-pdfs#!docu...</a>

评论 #5976602 未加载

surapanenialmost 12 years ago

This is similar to what we do at <a href="http://searchtower.com" rel="nofollow">http://searchtower.com</a> , where you can store, view, index and search the data.

architguptaalmost 12 years ago

Do you do OCR for text extraction?

评论 #5965837 未加载

raalmost 12 years ago

Nice. Why no paid options? I'm guessing because this was a weekend project.If so, nice work!

评论 #5965687 未加载

alkoualmost 12 years ago

do you use pdftotext internally or something else?

评论 #5966691 未加载

12 comments

hnriotalmost 12 years ago

评论 #5967164 未加载

zdwalmost 12 years ago

kijinalmost 12 years ago

评论 #5965887 未加载

midasalmost 12 years ago

Going from PDF to nicely formatted word doc would be huge for lawyers and people who do a lot of contract negotiations. It's hard to do well though.

评论 #5966725 未加载

评论 #5966276 未加载

评论 #5966257 未加载

rcfoxalmost 12 years ago

chensteralmost 12 years ago

评论 #5966685 未加载

TillEalmost 12 years ago

Neat, but practically who would want to do this with an API rather than installable software?

评论 #5965881 未加载

ismaelcalmost 12 years ago

评论 #5976602 未加载

surapanenialmost 12 years ago

This is similar to what we do at <a href="http://searchtower.com" rel="nofollow">http://searchtower.com</a> , where you can store, view, index and search the data.

architguptaalmost 12 years ago

Do you do OCR for text extraction?

评论 #5965837 未加载

raalmost 12 years ago

Nice. Why no paid options? I'm guessing because this was a weekend project.If so, nice work!

评论 #5965687 未加载

alkoualmost 12 years ago

do you use pdftotext internally or something else?

评论 #5966691 未加载

Show HN: an API to extract text from a PDF

12 comments

Show HN: an API to extract text from a PDF

12 comments