Ask HN: Way to extract relevant parts from a PDF based on a question?

1 pointsby madhatter999about 1 year ago

Dear HN,I am trying to do some semantic search in a given corpus of PDF documents based on a question as input. My goal is to find the relevant parts from the PDF that best answers the input question. I am interested in finding out concepts, frameworks, and methodologies that will help me with this task. If you have any pointers, I would greatly appreciate it!

3 comments

liampullesabout 1 year ago

This is a key usecase for text embeddings. Essentially it is a process of converting sentences or paragraphs to vectors, where the closeness of vectors then represents a semantic similarity.So you can convert all the paragraphs in your document into vectors, convert your question into a vector, and then find the e.g. 10 closest vectors, or all that fall under a certain maximum distance, etc.You can store the embeddings in a vector database, to search across multiple documents.

评论 #40201197 未加载

verdvermabout 1 year ago

LlamaIndex is my tool of choice right now<a href="https://docs.llamaindex.ai/en/stable/" rel="nofollow">https://docs.llamaindex.ai/en/stable/</a><a href="https://docs.llamaindex.ai/en/stable/examples/citation/pdf_page_reference/?h=pdf" rel="nofollow">https://docs.llamaindex.ai/en/stable/examples/citation/pdf_p...</a>I'm using it with Qdrant and can get the text sections & locations that are tied to the answer & citation as well

评论 #40201205 未加载

anoni2about 1 year ago

Try: <a href="https://notebooklm.google/" rel="nofollow">https://notebooklm.google/</a>

评论 #40201211 未加载