Hello HN, can anyone recommend a library/API for extracting the text and images from a PDF?<p>We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.<p>Thanks for any suggestions.
Hi,
We used PdfTextStream for extracting information from pdf documents in a similar manner as you describe (pre-known regions of the document), after looking at few other options. It was not very easy though working with coordinates and rectangles though :)<p>We observed that the text in our pdf had a structure to it. So instead we simply dumped the text from pdf using pdftotext and wrote an ANTLR grammar for the structure we saw. This enabled us to parse relevant information from the text dump.
I don't know about positional information, but I've had good luck with PDFBox for text extraction. And by good luck I mean as good as it gets considering I am using something for free and working with the PDF standard.<p>This was a system used in production but had several checks and fallback mechanisms because the process was unreliable.
<a href="http://www.pdflib.com/products/tet/" rel="nofollow">http://www.pdflib.com/products/tet/</a><p>TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.
Others have mentioned PDFTextStream (<a href="http://snowtide.com" rel="nofollow">http://snowtide.com</a>), which is our Java and .NET product. Our RegionOutputTarget class (<a href="http://snowtide.com/docframe.php/com.snowtide.pdf.RegionOutputTarget" rel="nofollow">http://snowtide.com/docframe.php/com.snowtide.pdf.RegionOutp...</a>) allows you do to selective text extraction based on spatial coordinates quite easily.<p>If anyone has any questions, feel free to ping me.
I've used pdftohtml -xml from poppler-utils for similar purposes (text with position info; I wasn't interested in images although I believe pdftohtml handles them too).<p>Poppler is the library that pdftohtml uses for this.