科技回声

Hey HN! I've build a simple tool to index and search your documents. This uses two great open source libraries: apache tika (for extracting content from docs) and apache lucene (for searching). It's been built with kotlin ktor as a web framework.<p>You can index all kind of files (i.e doc, docx, xls, ppt, pdf, txt, html even ORC pdfs) and then search them using very advanced queries like "always contain X", "never contain X", "X near Y", wildcard search, proper stemming support etc.<p>We're using it on my work where we have hundreds of thousands of doc/docx/pdf files and it works flawlessly!

Yes this is great. I looked into stitching these together but always figured it would be a huge undertaking. Consider looking at TensorFlow for OCR which should be much better and maybe faster.

nice work on this! i've been looking for something like this to manage my own docs.<p>one thing that caught my eye was the mention of 'proper stemming support' - can you elaborate on how you're handling stemming? are you using a specific library or rolling your own implementation? also, have you considered adding any sort of faceting/search filtering to the results?

Do the search results have document page numbers?

Yes this is great. I looked into stitching these together but always figured it would be a huge undertaking. Consider looking at TensorFlow for OCR which should be much better and maybe faster.

Do the search results have document page numbers?

Show HN: Index and search all your documents

3 条评论

Show HN: Index and search all your documents

3 条评论

Show HN: Index and search *all* your documents

3 条评论

Show HN: Index and search *all* your documents

3 条评论

Show HN: Index and search all your documents

Show HN: Index and search all your documents