The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data. The PDF parser works off text layer and also offers a OCR option to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a modified version of tika which uses tesseract underneath.<p>The PDF Parser offers the following features:<p>* Sections and subsections along with their levels.
* Paragraphs - combines lines.
* Links between sections and paragraphs.
* Tables along with the section the tables are found in.
* Lists and nested lists.
* Join content spread across pages.
* Removal of repeating headers and footers.
* Watermark removal.
* OCR with boundary boxes
One additional library to add, if you're working with scientific papers: <a href="https://github.com/kermitt2/grobid">https://github.com/kermitt2/grobid</a>. I use this with paperetl (<a href="https://github.com/neuml/paperetl">https://github.com/neuml/paperetl</a>).
Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.<p>Here's a couple examples:<p>- <a href="https://neuml.hashnode.dev/build-rag-pipelines-with-txtai" rel="nofollow">https://neuml.hashnode.dev/build-rag-pipelines-with-txtai</a><p>- <a href="https://neuml.hashnode.dev/extract-text-from-documents" rel="nofollow">https://neuml.hashnode.dev/extract-text-from-documents</a><p>Disclaimer: I'm the primary author of txtai (<a href="https://github.com/neuml/txtai">https://github.com/neuml/txtai</a>).
This looks like it could be very helpful. The company I work for has a PDF comparison tool called "PDFC" which can read PDFs and runs comparisons of semantic differences. <a href="https://www.inetsoftware.de/products/pdf-content-comparer" rel="nofollow">https://www.inetsoftware.de/products/pdf-content-comparer</a><p>Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.
Tesseract OCR fallback sounds great!<p>There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?
great effort and very interesting. However, I go to Github and I see "This organization has no public members" .. I do not know who you are at all, or what else might be part of this without disclosure.<p>Overall, I believe there has to be some middle ground for identification and trust building over time, between "hidden group with no names on $CORP secure site" and other traditional means of introduction and trust building.<p>thanks for posting this interesting and relevant work
Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM/RAG project: <a href="https://github.com/nlmatics/llmsherpa">https://github.com/nlmatics/llmsherpa</a>. See examples and notebook in the repo.