科技回声

12 条评论

hnriot将近 12 年前

Why not just system(pdf2html) - I don't see the point since this level of functionality is trivially achieved. If it did something over and above that it might be useful, like OCR, but even that's not hard to add.

评论 #5967164 未加载

zdw将近 12 年前

If you're doing this local/cli`pdftext`, from <a href="http://www.foolabs.com/xpdf/" rel="nofollow">http://www.foolabs.com/xpdf/</a>For OCR, `pdfimages` (also from xpdf), combined with ImageMagick's `convert`, and `tesseract` (<a href="http://code.google.com/p/tesseract-ocr/" rel="nofollow">http://code.google.com/p/tesseract-ocr/</a>) works passably well.

kijin将近 12 年前

I have some questions:1. Why return an array of texts? Where do the texts get split up? At page boundaries? Column boundaries? At the end of each line? If a line is interrupted by a corner of an image and continues a couple of inches afterward, does it get treated as a separate text? (I once used a PDF->text extractor program that spit out every word sepearately, often in an incorrect order. That probably had to do with how the PDF was organized internally.)2. "The PDF file should be smaller than 1 Mbit" -> You mean 1 megabyte, right? Because 1 megabit is only 125-128 kilobytes.

评论 #5965887 未加载

midas将近 12 年前

Going from PDF to nicely formatted word doc would be huge for lawyers and people who do a lot of contract negotiations. It's hard to do well though.

评论 #5966725 未加载

评论 #5966276 未加载

评论 #5966257 未加载

rcfox将近 12 年前

I've recently been working on extracting text from PDFs myself. I've found that `pdftohtml -xml` from the Poppler utils does a decent job of it, and includes a bounding box for each piece of text. I've submitted a few patches to their Bugzilla to also include the transformation matrix as well as some extra styling information.

chenster将近 12 年前

I googled "converting PDF to text" and "converting PDF to html". A tons of services already exist out there. Apparently, it's not something new. How do you plan to compete? Are you planning to focus on data extraction rather than conversion?

评论 #5966685 未加载

TillE将近 12 年前

Neat, but practically who would want to do this with an API rather than installable software?

评论 #5965881 未加载

ismaelc将近 12 年前

Hey I've documented this in Mashape - <a href="https://www.mashape.com/ismaelc/extract-text-from-pdfs#!documentation" rel="nofollow">https://www.mashape.com/ismaelc/extract-text-from-pdfs#!docu...</a>

评论 #5976602 未加载

surapaneni将近 12 年前

This is similar to what we do at <a href="http://searchtower.com" rel="nofollow">http://searchtower.com</a> , where you can store, view, index and search the data.

architgupta将近 12 年前

Do you do OCR for text extraction?

评论 #5965837 未加载

ra将近 12 年前

Nice. Why no paid options? I'm guessing because this was a weekend project.If so, nice work!

评论 #5965687 未加载

alkou将近 12 年前

do you use pdftotext internally or something else?

评论 #5966691 未加载

12 条评论

hnriot将近 12 年前

评论 #5967164 未加载

zdw将近 12 年前

kijin将近 12 年前

评论 #5965887 未加载

midas将近 12 年前

Going from PDF to nicely formatted word doc would be huge for lawyers and people who do a lot of contract negotiations. It's hard to do well though.

评论 #5966725 未加载

评论 #5966276 未加载

评论 #5966257 未加载

rcfox将近 12 年前

chenster将近 12 年前

评论 #5966685 未加载

TillE将近 12 年前

Neat, but practically who would want to do this with an API rather than installable software?

评论 #5965881 未加载

ismaelc将近 12 年前

评论 #5976602 未加载

surapaneni将近 12 年前

This is similar to what we do at <a href="http://searchtower.com" rel="nofollow">http://searchtower.com</a> , where you can store, view, index and search the data.

architgupta将近 12 年前

Do you do OCR for text extraction?

评论 #5965837 未加载

ra将近 12 年前

Nice. Why no paid options? I'm guessing because this was a weekend project.If so, nice work!

评论 #5965687 未加载

alkou将近 12 年前

do you use pdftotext internally or something else?

评论 #5966691 未加载

Show HN: an API to extract text from a PDF

12 条评论

Show HN: an API to extract text from a PDF

12 条评论