Ask HN: How would you build a search engine in 2019?

35 pointsby throwaway13000over 5 years ago

So, I was wondering how to build a competitor to Google. We have Common Crawl and Internet Archive. Distributed systems is pretty well understood. How would one go about buiding a search engine in 2019?Would you do what duckduckgo did, which is to use some body else's index and ranking or would you just build your index from commoncrawl? How are Ecosia and Startpage.com able to stay profitable without doing either?Does that mean we can have many niche search engines? Can we crawl CommonCrawl and build an index for less than $10K (what one individual can do out of pocket.)

9 comments

bwbover 5 years ago

I think the why changes the approach...I think my takeaway from the last 10 years, is that a lot of the info on websites that was by real people has disappeared and you have a lot of spammy blog and heavily commercial approaches. And, most of the real info has gone into facebook groups, quora, reddit comments, slack, twitter, and so on.The problem is those are all closed-door eco-systems in a lot of ways, and the knowledge is hard to differentiate from the temporal messaging.I think if I was going to approach this I would build software that users run, or browser add ons that lets user tag and save information in some type of format, and then that contributes to a knowledge search information.For example, I am a member of several FB groups focused around specific expat groups for where I live. There are great pieces of wisdom and hard to find info in there. I'd love to with a chrome extension say save this and here is a little context (or if it could know that is great from formats).Then try to figure out how to make that public and searchable.

评论 #21220885 未加载

评论 #21221264 未加载

rossdavidhover 5 years ago

So, if you want to make a competitor to Google, you would need to have your own index, but it would be prohibitively expensive to make one on Google's scale. So, you would need to make a search engine wherein most websites you can simply not index, because they are clearly not what you're looking for. That way, you can "only" index the 1% or so of the internet that is in your niche. Some ideas:- indie websites only (no news, no medium, no ecommerce, etc.), for those who want to find individuals who still maintain their own website, and say something interesting on it - low-size websites only, for people with very low bandwidth; anything above a certain size (e.g. 1Mb) and it doesn't get indexed - recipes (but there are some niche websites for this already) - websites with no ads on them (but this may conflict with your business model, if you have one) - websites focused on a certain geographic area (e.g. websites with information by, for, and about Texas, or Slovakia, or Buenos Aires) - websites with no javascript on them (for people who want to be able to turn off javascript, but don't have a good way of finding out which websites they can still use to get a particular piece of info)

poutaover 5 years ago

I've been having this idea of crowdsourcing a better Google via a browser extension.Every time I repeat a search query and end up finding the answer in the same website I visited before I wish I marked that link as the 'definite answer'.Next time I search for the same information the extension would point me directly to my previously marked link.Maybe by letting people could subscribe to each other answers I could bootstrap better google. Developers seem a good initial target market. Students too...

freediverover 5 years ago

I like this topic a lot.Obviously, the easiest is to build on top of existing index. This in turn makes it a purely marketing play (DDG's marketing play is "privacy").Here is one exploration of the following concept: search engine built on top of high quality sites only (as vetted by HN submission history)<a href="https://cse.google.com/cse?cx=014479775183020491825:c2lrlzrogb5" rel="nofollow">https://cse.google.com/cse?cx=014479775183020491825:c2lrlzro...</a>Described in full here: <a href="https://news.ycombinator.com/item?id=21209358" rel="nofollow">https://news.ycombinator.com/item?id=21209358</a>

probinsoover 5 years ago

I would build a discovery system. Search systems imply that you know what you're looking for, it also implies a short query.Imagine instead you start writing out your ideas in natural form. Documents will appear that are relevant to your ideas, but with the goal of diversifying category. Instead of the top 100 results, you may just get minimal results per category.As you continue to write the relevant documents get more constraint, but continue to attempt to maximize diversity.I don't know if this work for everything, but would be interesting.

ian0over 5 years ago

In my mind there are two flavours of queries. The first are those which are "ctrl f" in nature, ie I want to query the entire web to find string x. Which are obviously better suited to indexes. The second are more knowledge based in nature. IE I want to find quality information about a certain topic. These benefit from curation, as anyone who has added "stackoverflow" or "reddit" or "wiki" to a google query will know.So I would start with a curated and crowdsourced first page results for the top x% of these knowledge based queries. With wikipedia like guidelines and moderation to ensure the quality of the sites mentioned is up to scratch coupled with an inbuilt feedback mechanism from people browsing. I think wikipedia has proven that while difficult, a scheme like this is indeed possible.I also think you can start very niche and play with the results structure. For example, I like motorcycles and after years of browsing I have discovered the best places for reviews and information. Even just this use case could benefit from a better structure of results page and removal of all the spammy sites. The same goes for other niches like cooking & programming languages.

评论 #21220635 未加载

rootshelledover 5 years ago

You can't build something that can compete with Google without some very very deep pockets and lot's of data.Same problem with niche search engines, unless they have some unique features/properties that Google doesn't have you are plainly better off with Google.Google has a Monopoly on search, which looking at the market will hold up for the foreseeable future.So unless you can offer a specific feature(set) for a niche or you just want to build it for the hell of it I wouldn't reccomend anyone to go into search engines.I would probably go with what duckduckgo does but offer unique features that are usefull for one or more niches.

kamutunaover 5 years ago

with something like proxycrawl you should be able to build the crawler yourself and start crawling to build a big index. Once you get big enough and you start bringing traffic to the sites, you can then stop using them and create your own user agent and ask sites to whitelist you. It will take some time but can be done

hakejpcover 5 years ago

Can we leverage torrents somehow to create distributed indices ?

评论 #21219874 未加载