TechEcho

7 comments

atripathiover 14 years ago

Hi, We used PdfTextStream for extracting information from pdf documents in a similar manner as you describe (pre-known regions of the document), after looking at few other options. It was not very easy though working with coordinates and rectangles though :)We observed that the text in our pdf had a structure to it. So instead we simply dumped the text from pdf using pdftotext and wrote an ANTLR grammar for the structure we saw. This enabled us to parse relevant information from the text dump.

scorpioxyover 14 years ago

I don't know about positional information, but I've had good luck with PDFBox for text extraction. And by good luck I mean as good as it gets considering I am using something for free and working with the PDF standard.This was a system used in production but had several checks and fallback mechanisms because the process was unreliable.

silvestrovover 14 years ago

<a href="http://www.pdflib.com/products/tet/" rel="nofollow">http://www.pdflib.com/products/tet/</a>TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

cemerickover 14 years ago

Others have mentioned PDFTextStream (<a href="http://snowtide.com" rel="nofollow">http://snowtide.com</a>), which is our Java and .NET product. Our RegionOutputTarget class (<a href="http://snowtide.com/docframe.php/com.snowtide.pdf.RegionOutputTarget" rel="nofollow">http://snowtide.com/docframe.php/com.snowtide.pdf.RegionOutp...</a>) allows you do to selective text extraction based on spatial coordinates quite easily.If anyone has any questions, feel free to ping me.

iworkforthemover 14 years ago

in Java, there are Apache PDFBox and jPDFText. the nature of pdf make it very difficult to extract it correctly and consistently.

评论 #1666871 未加载

mgedminover 14 years ago

I've used pdftohtml -xml from poppler-utils for similar purposes (text with position info; I wasn't interested in images although I believe pdftohtml handles them too).Poppler is the library that pdftohtml uses for this.

marescaover 14 years ago

PDFSharp is good if you are using .NET<a href="http://www.pdfsharp.net/" rel="nofollow">http://www.pdfsharp.net/</a>

7 comments

atripathiover 14 years ago

scorpioxyover 14 years ago

silvestrovover 14 years ago

cemerickover 14 years ago

iworkforthemover 14 years ago

in Java, there are Apache PDFBox and jPDFText. the nature of pdf make it very difficult to extract it correctly and consistently.

评论 #1666871 未加载

mgedminover 14 years ago

marescaover 14 years ago

PDFSharp is good if you are using .NET<a href="http://www.pdfsharp.net/" rel="nofollow">http://www.pdfsharp.net/</a>

Ask HN: Recommendations for PDF text extraction

7 comments

Ask HN: Recommendations for PDF text extraction

7 comments