TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Open-source Rule-based PDF parser for RAG

293 pointsby jnathsfover 1 year ago
The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data. The PDF parser works off text layer and also offers a OCR option to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a modified version of tika which uses tesseract underneath.<p>The PDF Parser offers the following features:<p>* Sections and subsections along with their levels. * Paragraphs - combines lines. * Links between sections and paragraphs. * Tables along with the section the tables are found in. * Lists and nested lists. * Join content spread across pages. * Removal of repeating headers and footers. * Watermark removal. * OCR with boundary boxes

13 comments

dmezzettiover 1 year ago
One additional library to add, if you&#x27;re working with scientific papers: <a href="https:&#x2F;&#x2F;github.com&#x2F;kermitt2&#x2F;grobid">https:&#x2F;&#x2F;github.com&#x2F;kermitt2&#x2F;grobid</a>. I use this with paperetl (<a href="https:&#x2F;&#x2F;github.com&#x2F;neuml&#x2F;paperetl">https:&#x2F;&#x2F;github.com&#x2F;neuml&#x2F;paperetl</a>).
dmezzettiover 1 year ago
Nice project! I&#x27;ve long used Tika for document parsing given it&#x27;s maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.<p>Here&#x27;s a couple examples:<p>- <a href="https:&#x2F;&#x2F;neuml.hashnode.dev&#x2F;build-rag-pipelines-with-txtai" rel="nofollow">https:&#x2F;&#x2F;neuml.hashnode.dev&#x2F;build-rag-pipelines-with-txtai</a><p>- <a href="https:&#x2F;&#x2F;neuml.hashnode.dev&#x2F;extract-text-from-documents" rel="nofollow">https:&#x2F;&#x2F;neuml.hashnode.dev&#x2F;extract-text-from-documents</a><p>Disclaimer: I&#x27;m the primary author of txtai (<a href="https:&#x2F;&#x2F;github.com&#x2F;neuml&#x2F;txtai">https:&#x2F;&#x2F;github.com&#x2F;neuml&#x2F;txtai</a>).
评论 #39116723 未加载
epagaover 1 year ago
This looks like it could be very helpful. The company I work for has a PDF comparison tool called &quot;PDFC&quot; which can read PDFs and runs comparisons of semantic differences. <a href="https:&#x2F;&#x2F;www.inetsoftware.de&#x2F;products&#x2F;pdf-content-comparer" rel="nofollow">https:&#x2F;&#x2F;www.inetsoftware.de&#x2F;products&#x2F;pdf-content-comparer</a><p>Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.
评论 #39134020 未加载
lmeyerovover 1 year ago
Tesseract OCR fallback sounds great!<p>There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?
评论 #39116662 未加载
评论 #39117354 未加载
mistrial9over 1 year ago
great effort and very interesting. However, I go to Github and I see &quot;This organization has no public members&quot; .. I do not know who you are at all, or what else might be part of this without disclosure.<p>Overall, I believe there has to be some middle ground for identification and trust building over time, between &quot;hidden group with no names on $CORP secure site&quot; and other traditional means of introduction and trust building.<p>thanks for posting this interesting and relevant work
asuklaover 1 year ago
Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM&#x2F;RAG project: <a href="https:&#x2F;&#x2F;github.com&#x2F;nlmatics&#x2F;llmsherpa">https:&#x2F;&#x2F;github.com&#x2F;nlmatics&#x2F;llmsherpa</a>. See examples and notebook in the repo.
firtozover 1 year ago
Thank you for sharing. Are there some example input output pairs somewhere?
评论 #39117471 未加载
huqedatoover 1 year ago
I tried to parse a few hundreds pdfs with it. The results are pretty decent. If this was developed in Julia, it would be ten times faster (at least).
guidedlightover 1 year ago
How does this differ from Azure Document Intelligence, or are they effectively the same thing?
评论 #39117484 未加载
评论 #39117508 未加载
评论 #39114599 未加载
评论 #39116617 未加载
jvdvegtover 1 year ago
Do you ave any examples? There doesn&#x27;t seem to be a single PDF file in the repo.
评论 #39117515 未加载
xfalcoxover 1 year ago
We&#x27;ve been looking for something exactly like this, thanks for sharing!
ilakshover 1 year ago
How does this compare to PaddleOCR?<p>Looks like Apache 2 license which is nice.
genewitchover 1 year ago
&quot;Retrieval Augmented Generation&quot;