Ask HN: I have many PDFs – what is the best local way to leverage AI for search?

257 pointsby phodo12 months ago

As the title says, I have many PDFs - mostly scans via Scansnap - but also non-scans. These are sensitive in nature, e.g. bills, documents, etc. I would like a local-first AI solution that allows me to say things like: "show me all tax documents for August 2023" or "show my home title". Ideally it is Mac software that can access iCloud too, since that where I store it all. I would prefer to not do any tagging. I would like to optimize on recall over precision, so False Positives in the search results are ok. What are modern approaches to do this, without hacking one up on my own?

38 comments

bastien212 months ago

You don't. You use a full-text indexer and normal search tools. A chatbot is only going to decrease the integrity of query results.

评论 #40530624 未加载

评论 #40530786 未加载

评论 #40530058 未加载

评论 #40531860 未加载

评论 #40564169 未加载

pierre12 months ago

RAG cli from llamaindex, allow you to do it 100% locally when used with ollama or llamacpp instead of OpenAI.<a href="https://docs.llamaindex.ai/en/stable/getting_started/starter_tools/rag_cli/" rel="nofollow">https://docs.llamaindex.ai/en/stable/getting_started/starter...</a>

评论 #40529638 未加载

评论 #40530387 未加载

评论 #40529953 未加载

评论 #40535007 未加载

评论 #40529874 未加载

m0shen12 months ago

Paperless supports OCR + full text indexing: <a href="https://docs.paperless-ngx.com/" rel="nofollow">https://docs.paperless-ngx.com/</a>As far as AI goes, not sure.

评论 #40530543 未加载

Ey7NFZ3P0nzAe12 months ago

I am a medical students with thousands and thousands of PDF and was unsatisfied with RAG tools so I made my own. It can consume basically any type of content (pdf, epub, youtube playlist, anki database, mp3, you name it) and does a multi step RAG by first using embedding then filtering using a smaller LLM then answering using by feeding each remaining document to the strong LLM then combine those answers.It supports virtually all LLMs and embeddings, including local LLMs and local embedding It scales surprisingly well and I have tons of improvements to come, when I have some free time or procrastinate. Don't hesitate to ask for features!Here's the link: <a href="https://github.com/thiswillbeyourgithub/DocToolsLLM/">https://github.com/thiswillbeyourgithub/DocToolsLLM/</a>

评论 #40539662 未加载

constantinum12 months ago

The primary challenge is not just about harnessing AI for search; it's about preparing complex documents of various formats, structures, designs, scans, multi-layout tables, and even poorly captured images for LLM consumption. This is a crucial issue.There is a 20 min read on why parsing PDFs is hell: <a href="https://unstract.com/blog/pdf-hell-and-practical-rag-applications/" rel="nofollow">https://unstract.com/blog/pdf-hell-and-practical-rag-applica...</a>To parse PDFs for RAG applications, you'll need tools like LLMwhisperer[1] or unstructured.io[2].Now back to your problem:This solution might be an overkill for your requirement, but you can try the following:To set things up quickly, try Unstract[3], an open-source document processing tool. You can set this up and bring your own LLM models; it also supports local models. It has a GUI to write prompts to get insights from your documents.[4][1] <a href="https://unstract.com/llmwhisperer/" rel="nofollow">https://unstract.com/llmwhisperer/</a> [2] <a href="https://unstructured.io/" rel="nofollow">https://unstructured.io/</a> [3] <a href="https://github.com/Zipstack/unstract">https://github.com/Zipstack/unstract</a> [4] <a href="https://github.com/Zipstack/unstract/blob/main/docs/assets/prompt_studio.png">https://github.com/Zipstack/unstract/blob/main/docs/assets/p...</a>

评论 #40531218 未加载

评论 #40531369 未加载

elrostelperien12 months ago

For macOS, there's this: <a href="https://pdfsearch.app/" rel="nofollow">https://pdfsearch.app/</a>Without AI, but searching the PDF content, I use Recoll (<a href="https://www.recoll.org/" rel="nofollow">https://www.recoll.org/</a>) or ripgrep-all (<a href="https://github.com/phiresky/ripgrep-all">https://github.com/phiresky/ripgrep-all</a>)

评论 #40540743 未加载

hm-nah12 months ago

You have the find a good OCR tool that you can run locally on your hardware. RAG depends on your doc processing pipeline.It’s not local, but the Azure Document Intelligence OCR service has a number of prebuilt models. The “prebuilt-read” model is $1.50/1k pages. Once you OCR your docs, you’ll have a JSON of all the text AND you get breakdowns by page/word/paragraph/tables/figures/alllll with bouding-boxes.Forget the Lang/Llama/Chain-theory. You can do it all in vanilla Python.

Kikawala12 months ago

Quivr: <a href="https://github.com/QuivrHQ/quivr">https://github.com/QuivrHQ/quivr</a>SecureAI-Tools: <a href="https://github.com/SecureAI-Tools/SecureAI-Tools">https://github.com/SecureAI-Tools/SecureAI-Tools</a>

pixelmonkey12 months ago

rga, aka ripgrep-all, is my go-to for this. I suppose grep is a form of AI -- or, at least, an advanced intelligence that's wiser than it looks. ;)<a href="https://github.com/phiresky/ripgrep-all">https://github.com/phiresky/ripgrep-all</a>

评论 #40531015 未加载

SoftTalker12 months ago

If you haven’t given some serious thought to getting rid of most of the documents then consider it. There is very little need to keep most routine documents for more than a few years. If you think you need your electric bill for March 2006 at your fingertips, why?

评论 #40532890 未加载

Kikobeats12 months ago

You can use Microlink to turn PDF into HTML, and combine it with other service for processing the text data.Here an example turning a arxiv paper into real text:<a href="https://api.microlink.io/?data.html.selector=html&embed=html&meta=false&url=https://arxiv.org/pdf/2104.12871" rel="nofollow">https://api.microlink.io/?data.html.selector=html&embed=html...</a>It looks like PDF, but it you open devtools you can see it's actually a very precise HTML representation.

theolivenbaum12 months ago

If you're looking for something local, we develop an app for macOS and Windows that let's you search and talk to local files and data from cloud apps: <a href="https://curiosity.ai" rel="nofollow">https://curiosity.ai</a> For the AI features, you can use OpenAI or local models (the app uses llama.cpp in the background, it ships with llama3 and a few other models, and we're soon going to let you use any .gguf model)

brailsafe12 months ago

Like many others have suggested, local indexing is what I use for this, although some more natural interface may be better for structured search and querying.What I haven't seen suggested though, is the built-in spotlight. Press CMD+Space, type some unique words that might appear in the document, and spotlight will search it. This also works surprisingly well for non-OCRd images of text, anything inside a zip file, an email, etc..

yousnail12 months ago

PrivateGPT is a great starting point for using a local model and RAG. Text-generation-ui, oogabooga, using superbooga V2 is very nice and more customizable.I’ve used both for sensitive internal SOPs, and both work quite well. Private gpt excels at ingesting many separate documents, the other excels at customization. Both are totally offline, and can use mostly whatever models you want.

ssahoo11 months ago

This could be a humor or real hack.Get a copilot PC with recall enabled and quickly scan through the documents by opening in Adobe acrobat reader. Voillla! You will have an sqlite dB that has your index. Few days later, Adobe could have your data in their llm.

gibsonf112 months ago

<a href="https://graphmetrix.com/trinpod-server" rel="nofollow">https://graphmetrix.com/trinpod-server</a>

pawelduda12 months ago

Try <a href="https://github.com/phiresky/ripgrep-all">https://github.com/phiresky/ripgrep-all</a> before going down the rabbit hole of AI and advanced indexers. Quick to set up and undo if that's not what you want, but I'm pretty sure you'll be surprised how far can this get you

ilaksh12 months ago

If you want to run locally you can look into this <a href="https://github.com/PaddlePaddle/PaddleOCR">https://github.com/PaddlePaddle/PaddleOCR</a><a href="https://andrejusb.blogspot.com/2024/03/optimizing-receipt-processing-with.html" rel="nofollow">https://andrejusb.blogspot.com/2024/03/optimizing-receipt-pr...</a>But I suggest that you just skip that and use gpt-4o. They aren't actually going to steal your data.Sort through it to find anything with a credit card number or anything ahead time.Or you could look into InternVL..Or a combination of PaddleOCR first and then use a strong LLM via API, like gpt-4o or llama3 70b via together.aiIf you truly must do it locally, then if you have two 3090s or 4090s it might work out. Otherwise it the LLMs may not be smart enough to give good results.Leaving out the details of your hardware makes it impossible to give good advice about running locally. Other than, it's not really necessary.

评论 #40531267 未加载

bendsawyer12 months ago

I looked into this for sensitive material recently. In the end I got a purpose-built local system built and am having it remotely maintained. Cost: around 5k a year. I used <a href="http://www.skunkwerx.ai" rel="nofollow">http://www.skunkwerx.ai</a>, who are US based.The result is a huge step up from 'full text search' solutions, for my use case. I can have conversations with decades of documents, and it's incredibly helpful. The support scheme keeps my original documents unconnected from the machine, which I own, while updates are done over a remote link. It's great, and I feel safe.Things change so fast in this space that there did not seem to be a cheap, stable, local alternative. I honestly doubt one is coming. This is not a on-size-fits-all problem.

skapa_flow12 months ago

Google Drive. It doesn't fullfill the "local" criteria, but it works for us (small engineering firm). We synchronize our local file server with GD nighly and use it only for searching. Google is just good when it comes to search.

phodo12 months ago

Thank you all for the comments. Got a lot of good input and ways to think thru the tried and true tools (enjoying ripgrep-all + fzf) plus the standard ai/rag-style tools. I do think there is room for a bridge or an integrated way to pipe in similarity / embedding into the ripgreps of the world. Maybe something close to fzf’s piping model. Will explore if I have some time.

westcort12 months ago

Use Recoll on Linux or File Locator Lite on Windows to do RegEx searches. Design the RegEx searches with GPT or llama running locally (or write them yourself).

hulitu12 months ago

> Ask HN: I have many PDFs – what is the best local way to leverage AI for search?Adobe Reader can search all PDFs in a directory. They hide this function though.

kkfx12 months ago

Honestly?ocrmypdf + ripgrep-all, recoll (GUI+XLI xapian wrapper) if you prefer an indexed version, for mere full-text search, currently nothing gives better results. The semantic search it's still not there, Paperless-ngx, tagspaces and so on demand way too much time for adding just a single document to be useful at a certain scale.My own personal version is org-mode, I keep all my stuff org-attached, so instead of searching the pdfs I search my notes linking them, a kind of metadata-rich, taggable, quick, full-text search however even if org-ql is there I almost not use it, just org-roam-node-find and counsel-rg on notes. Once done this allow for quick enough manual and variously automated archiving, do it on a large home directory it's a very long and tedious manual work. For me it's worth done since I keep adding documents and using them, but it took more than an year to be "almost done enough" and it's still unfinished after 4 years.

treetalker12 months ago

On MacOS, use HoudahSpot. It’s awesome. Not AI, but as others have said, you likely want plain text search, not “AI” or a chatbot, for something like this.If you’re having trouble thinking of search terms to plug into HoudahSpot (or grep etc.) then I suppose you could ask a chatbot to assist your brainstorming, and then plug those terms into HoudahSpot/grep/etc.

epirogov12 months ago

Cheap but full featured solution for batch AI processing of PDF documents on your local is an Aspose.PDF ChatGPT plugin<a href="https://products.aspose.org/pdf/net/chat-gpt/" rel="nofollow">https://products.aspose.org/pdf/net/chat-gpt/</a>

dudus12 months ago

I tried Google's NotebookLM for this use case and was very pleased with the experience.If you trust Google that is.

评论 #40529166 未加载

评论 #40529357 未加载

jesterson12 months ago

The best tool I found for myself for similar goal was Devonthink. Using it for many years since and quite happy with it.There is no AI or any other modern fad, but fulltext search (including OCR for image files inside PDFs) works great

112358132112 months ago

Devonthink would do this with a tiny model to translate your natural length search prompts into its syntax and your folder/tag tree.If you're okay with some false positives, Devonthink would work as is, actually.

评论 #40529341 未加载

edgyquant12 months ago

Using python to dump the PDF to text then use llama3 (8B) to parse

评论 #40530400 未加载

jeffreyq12 months ago

Tangentially related, but you can try <a href="https://macro.com/" rel="nofollow">https://macro.com/</a> for reading your PDFs.

hypefi12 months ago

check out my app "Chofane" this is something that does it, local batch OCR scan for PDFs and PNG files, I am just launching it, you can export results to json and csv, do some text based search on results <a href="https://chofane-landing.pages.dev/" rel="nofollow">https://chofane-landing.pages.dev/</a>

Tylast12 months ago

You can try <a href="https://gpt4all.io/index.html" rel="nofollow">https://gpt4all.io/index.html</a>

sciencesama12 months ago

You can tabulate the info 90% of your info will be from single source. There are online tools that sort costco and walmart bills !!

gandalfthepink12 months ago

I use Curiosity AI. Good interface.

vrighter12 months ago

you use a tool intended for accurately searihing. Which is not ai based.

finack12 months ago

OCR and pattern matching on text are computationally cheap and incredibly easy to do. For example, tax documents often bear the name of your government's tax authority, which presumably you are familiar with and can search for. They also tend to have years on them.

评论 #40532591 未加载

adyashakti12 months ago

getcody.ai

评论 #40528972 未加载