TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Recommendations for PDF text extraction

16 pointsby kenverover 14 years ago
Hello HN, can anyone recommend a library/API for extracting the text and images from a PDF?<p>We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.<p>Thanks for any suggestions.

7 comments

atripathiover 14 years ago
Hi, We used PdfTextStream for extracting information from pdf documents in a similar manner as you describe (pre-known regions of the document), after looking at few other options. It was not very easy though working with coordinates and rectangles though :)<p>We observed that the text in our pdf had a structure to it. So instead we simply dumped the text from pdf using pdftotext and wrote an ANTLR grammar for the structure we saw. This enabled us to parse relevant information from the text dump.
scorpioxyover 14 years ago
I don't know about positional information, but I've had good luck with PDFBox for text extraction. And by good luck I mean as good as it gets considering I am using something for free and working with the PDF standard.<p>This was a system used in production but had several checks and fallback mechanisms because the process was unreliable.
silvestrovover 14 years ago
<a href="http://www.pdflib.com/products/tet/" rel="nofollow">http://www.pdflib.com/products/tet/</a><p>TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.
cemerickover 14 years ago
Others have mentioned PDFTextStream (<a href="http://snowtide.com" rel="nofollow">http://snowtide.com</a>), which is our Java and .NET product. Our RegionOutputTarget class (<a href="http://snowtide.com/docframe.php/com.snowtide.pdf.RegionOutputTarget" rel="nofollow">http://snowtide.com/docframe.php/com.snowtide.pdf.RegionOutp...</a>) allows you do to selective text extraction based on spatial coordinates quite easily.<p>If anyone has any questions, feel free to ping me.
iworkforthemover 14 years ago
in Java, there are Apache PDFBox and jPDFText. the nature of pdf make it very difficult to extract it correctly and consistently.
评论 #1666871 未加载
mgedminover 14 years ago
I've used pdftohtml -xml from poppler-utils for similar purposes (text with position info; I wasn't interested in images although I believe pdftohtml handles them too).<p>Poppler is the library that pdftohtml uses for this.
marescaover 14 years ago
PDFSharp is good if you are using .NET<p><a href="http://www.pdfsharp.net/" rel="nofollow">http://www.pdfsharp.net/</a>