what it is, i mean, what's stopping you from making the next google <i>search</i>?<p>Tf-Idf? No. Just kidding.
Infromation retrieval, text mining, ML/DL ?! what is going on with this field !
Every other resource seems outdated? What is the state of art ?<p>Reading some of these posts :
https://boyter.org/2010/08/build-vector-space-search-engine-python/<p>https://www.dr-josiah.com/2010/07/building-search-engine-using-redis-and.html<p>https://stevenloria.com/tf-idf/<p>https://stories.algolia.com/a-search-engine-in-css-b5ec4e902e97
I’ve done this over the last year on a tiny scale for my own needs: gorillafind.com. It’s from scratch and just for government sites to sidestep some of the challenges (but so far only has 50 sites). The cost per site is around $1/mo for crawling, indexing, converting file formats and then serving up results. It’s difficult but not impossible and very educational. If you’d like to hear more about doing it yourself and some of the challenges feel free to email me with the contact info on the site. My system isn’t open source but I’m more than happy to chat about the research I’ve done and how you can make one.<p>I’d start off with not doing state of the art because it’s overkill for an “MVP”. And if you don’t need proper browser rendering of pages, there’s open source crawlers out there like Nutch that might work. If you’re making one yourself, the outdated academic papers and presentations by search companies are a good resource as the basic ideas of crawling and indexing haven’t changed too much (even if ranking and other components have changed a lot). A search engine is really a set of related components and there are many examples out there to use as inspiration for your MVP.
Have you read this post?<p><a href="https://danluu.com/sounds-easy/" rel="nofollow">https://danluu.com/sounds-easy/</a><p>It describes some of the various difficulties in building the next Google Search, much better than I could.
And to add, i am not looking for a "how to learn"?<p>Neither it is some <i></i>site-wide search<i></i> que.<p>And not a business model as well.<p>Not a "privacy first but no results" search.<p>Something that works.
I built a “search engine” but unfortunately my approach lacked any implicit ranking mechanism.<p>TLDR: my regex search engine needs result ranking to be more useful before I consider showing it to other humans.