TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Index and search *all* your documents

21 点作者 spapas829 个月前
Hey HN! I&#x27;ve build a simple tool to index and search your documents. This uses two great open source libraries: apache tika (for extracting content from docs) and apache lucene (for searching). It&#x27;s been built with kotlin ktor as a web framework.<p>You can index all kind of files (i.e doc, docx, xls, ppt, pdf, txt, html even ORC pdfs) and then search them using very advanced queries like &quot;always contain X&quot;, &quot;never contain X&quot;, &quot;X near Y&quot;, wildcard search, proper stemming support etc.<p>We&#x27;re using it on my work where we have hundreds of thousands of doc&#x2F;docx&#x2F;pdf files and it works flawlessly!

3 条评论

unstatusthequo9 个月前
Yes this is great. I looked into stitching these together but always figured it would be a huge undertaking. Consider looking at TensorFlow for OCR which should be much better and maybe faster.
评论 #41228477 未加载
namanyayg9 个月前
nice work on this! i&#x27;ve been looking for something like this to manage my own docs.<p>one thing that caught my eye was the mention of &#x27;proper stemming support&#x27; - can you elaborate on how you&#x27;re handling stemming? are you using a specific library or rolling your own implementation? also, have you considered adding any sort of faceting&#x2F;search filtering to the results?
评论 #41224088 未加载
compressedgas9 个月前
Do the search results have document page numbers?
评论 #41265265 未加载