TechEcho

The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data. The PDF parser works off text layer and also offers a OCR option to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a modified version of tika which uses tesseract underneath.The PDF Parser offers the following features:* Sections and subsections along with their levels. * Paragraphs - combines lines. * Links between sections and paragraphs. * Tables along with the section the tables are found in. * Lists and nested lists. * Join content spread across pages. * Removal of repeating headers and footers. * Watermark removal. * OCR with boundary boxes

13 comments

dmezzettiover 1 year ago

One additional library to add, if you're working with scientific papers: <a href="https://github.com/kermitt2/grobid">https://github.com/kermitt2/grobid</a>. I use this with paperetl (<a href="https://github.com/neuml/paperetl">https://github.com/neuml/paperetl</a>).

dmezzettiover 1 year ago

Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.Here's a couple examples:- <a href="https://neuml.hashnode.dev/build-rag-pipelines-with-txtai" rel="nofollow">https://neuml.hashnode.dev/build-rag-pipelines-with-txtai</a>- <a href="https://neuml.hashnode.dev/extract-text-from-documents" rel="nofollow">https://neuml.hashnode.dev/extract-text-from-documents</a>Disclaimer: I'm the primary author of txtai (<a href="https://github.com/neuml/txtai">https://github.com/neuml/txtai</a>).

评论 #39116723 未加载

epagaover 1 year ago

This looks like it could be very helpful. The company I work for has a PDF comparison tool called "PDFC" which can read PDFs and runs comparisons of semantic differences. <a href="https://www.inetsoftware.de/products/pdf-content-comparer" rel="nofollow">https://www.inetsoftware.de/products/pdf-content-comparer</a>Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.

评论 #39134020 未加载

lmeyerovover 1 year ago

Tesseract OCR fallback sounds great!There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?

评论 #39116662 未加载

评论 #39117354 未加载

mistrial9over 1 year ago

great effort and very interesting. However, I go to Github and I see "This organization has no public members" .. I do not know who you are at all, or what else might be part of this without disclosure.Overall, I believe there has to be some middle ground for identification and trust building over time, between "hidden group with no names on $CORP secure site" and other traditional means of introduction and trust building.thanks for posting this interesting and relevant work

asuklaover 1 year ago

Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM/RAG project: <a href="https://github.com/nlmatics/llmsherpa">https://github.com/nlmatics/llmsherpa</a>. See examples and notebook in the repo.

firtozover 1 year ago

Thank you for sharing. Are there some example input output pairs somewhere?

评论 #39117471 未加载

huqedatoover 1 year ago

I tried to parse a few hundreds pdfs with it. The results are pretty decent. If this was developed in Julia, it would be ten times faster (at least).

guidedlightover 1 year ago

How does this differ from Azure Document Intelligence, or are they effectively the same thing?

评论 #39117484 未加载

评论 #39117508 未加载

评论 #39114599 未加载

评论 #39116617 未加载

jvdvegtover 1 year ago

Do you ave any examples? There doesn't seem to be a single PDF file in the repo.

评论 #39117515 未加载

xfalcoxover 1 year ago

We've been looking for something exactly like this, thanks for sharing!

ilakshover 1 year ago

How does this compare to PaddleOCR?Looks like Apache 2 license which is nice.

genewitchover 1 year ago

"Retrieval Augmented Generation"

13 comments

dmezzettiover 1 year ago

评论 #39116723 未加载

epagaover 1 year ago

评论 #39134020 未加载

lmeyerovover 1 year ago

Tesseract OCR fallback sounds great!There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?

评论 #39116662 未加载

评论 #39117354 未加载

mistrial9over 1 year ago

asuklaover 1 year ago

firtozover 1 year ago

Thank you for sharing. Are there some example input output pairs somewhere?

评论 #39117471 未加载

huqedatoover 1 year ago

I tried to parse a few hundreds pdfs with it. The results are pretty decent. If this was developed in Julia, it would be ten times faster (at least).

guidedlightover 1 year ago

How does this differ from Azure Document Intelligence, or are they effectively the same thing?

评论 #39117484 未加载

评论 #39117508 未加载

评论 #39114599 未加载

评论 #39116617 未加载

jvdvegtover 1 year ago

Do you ave any examples? There doesn't seem to be a single PDF file in the repo.

评论 #39117515 未加载

xfalcoxover 1 year ago

We've been looking for something exactly like this, thanks for sharing!

ilakshover 1 year ago

How does this compare to PaddleOCR?Looks like Apache 2 license which is nice.

genewitchover 1 year ago

"Retrieval Augmented Generation"

Show HN: Open-source Rule-based PDF parser for RAG

13 comments

Show HN: Open-source Rule-based PDF parser for RAG

13 comments