TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Way to extract relevant parts from a PDF based on a question?

1 pointsby madhatter999about 1 year ago
Dear HN,<p>I am trying to do some semantic search in a given corpus of PDF documents based on a question as input. My goal is to find the relevant parts from the PDF that best answers the input question. I am interested in finding out concepts, frameworks, and methodologies that will help me with this task. If you have any pointers, I would greatly appreciate it!

3 comments

liampullesabout 1 year ago
This is a key usecase for text embeddings. Essentially it is a process of converting sentences or paragraphs to vectors, where the closeness of vectors then represents a semantic similarity.<p>So you can convert all the paragraphs in your document into vectors, convert your question into a vector, and then find the e.g. 10 closest vectors, or all that fall under a certain maximum distance, etc.<p>You can store the embeddings in a vector database, to search across multiple documents.
评论 #40201197 未加载
verdvermabout 1 year ago
LlamaIndex is my tool of choice right now<p><a href="https:&#x2F;&#x2F;docs.llamaindex.ai&#x2F;en&#x2F;stable&#x2F;" rel="nofollow">https:&#x2F;&#x2F;docs.llamaindex.ai&#x2F;en&#x2F;stable&#x2F;</a><p><a href="https:&#x2F;&#x2F;docs.llamaindex.ai&#x2F;en&#x2F;stable&#x2F;examples&#x2F;citation&#x2F;pdf_page_reference&#x2F;?h=pdf" rel="nofollow">https:&#x2F;&#x2F;docs.llamaindex.ai&#x2F;en&#x2F;stable&#x2F;examples&#x2F;citation&#x2F;pdf_p...</a><p>I&#x27;m using it with Qdrant and can get the text sections &amp; locations that are tied to the answer &amp; citation as well
评论 #40201205 未加载
anoni2about 1 year ago
Try: <a href="https:&#x2F;&#x2F;notebooklm.google&#x2F;" rel="nofollow">https:&#x2F;&#x2F;notebooklm.google&#x2F;</a>
评论 #40201211 未加载