TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Index and search *all* your documents

21 pointsby spapas8210 months ago
Hey HN! I&#x27;ve build a simple tool to index and search your documents. This uses two great open source libraries: apache tika (for extracting content from docs) and apache lucene (for searching). It&#x27;s been built with kotlin ktor as a web framework.<p>You can index all kind of files (i.e doc, docx, xls, ppt, pdf, txt, html even ORC pdfs) and then search them using very advanced queries like &quot;always contain X&quot;, &quot;never contain X&quot;, &quot;X near Y&quot;, wildcard search, proper stemming support etc.<p>We&#x27;re using it on my work where we have hundreds of thousands of doc&#x2F;docx&#x2F;pdf files and it works flawlessly!

3 comments

unstatusthequo9 months ago
Yes this is great. I looked into stitching these together but always figured it would be a huge undertaking. Consider looking at TensorFlow for OCR which should be much better and maybe faster.
评论 #41228477 未加载
namanyayg9 months ago
nice work on this! i&#x27;ve been looking for something like this to manage my own docs.<p>one thing that caught my eye was the mention of &#x27;proper stemming support&#x27; - can you elaborate on how you&#x27;re handling stemming? are you using a specific library or rolling your own implementation? also, have you considered adding any sort of faceting&#x2F;search filtering to the results?
评论 #41224088 未加载
compressedgas9 months ago
Do the search results have document page numbers?
评论 #41265265 未加载