Writing your own search engine is hard (2004)

161 点作者 georgehill将近 3 年前

14 条评论

Xeoncross将近 3 年前

Yeah, there are certainly more problems these days. For one, the size of the web is larger, more of it is spam causing issues with pure page rank to detect networks that heavily link to each other.Important sites have a bunch of anti-crawling detection set up (especially news sites). It's even worse that the best user-generated content is behind walled gardens in facebook groups, slack channels, quora threads, etc...The rest of the good sites are javascript-heavy and you often have to run chrome headless to render the page and find the content - but that is detectable so you end up renting IP's from mobile number farms or trying to build your own 4G network.On the upside, <a href="https://commoncrawl.org/" rel="nofollow">https://commoncrawl.org/</a> now exists and makes the prototype crawling work much easier. It's not the full internet, but gives you plenty to work with and test against so you can skip to the part where you figure out if you can produce anything useful should you actually try to crawl the whole internet.

评论 #32206449 未加载

评论 #32207809 未加载

评论 #32211292 未加载

评论 #32207131 未加载

评论 #32210733 未加载

评论 #32210737 未加载

boyter将近 3 年前

Glad to see this on the front page. One of those posts I reread every now and then. Better yet it’s written by Anna Patterson, who in addition to the mentioned searches at the bottom wrote chunks of Cuil (interesting even if it failed) and works on parts of Googles index both before Cuil and I think now.Sadly it’s a little out of date. I’d love to see a more modern post by someone. Perhaps the authors of mojeek, right dao or someone Elise running their own custom index. Heck I’d pay for some by Matt Wells of Gigablast or those behind Blekko. The whole space is so secretive that for those really interested in the space only crumbs of information are ever really released.If you are into this space or just curious the videos about bitfunnel which forms parts of the bing index are an excellent watch <a href="https://www.youtube.com/watch?v=1-Xoy5w5ydM" rel="nofollow">https://www.youtube.com/watch?v=1-Xoy5w5ydM</a> and <a href="https://www.clsp.jhu.edu/events/mike-hopcroft-microsoft/#.YT_6UC0Rpf0" rel="nofollow">https://www.clsp.jhu.edu/events/mike-hopcroft-microsoft/#.YT...</a>

评论 #32249208 未加载

streets1627将近 3 年前

Hey folks, I am one of the co-founders of neeva.comWhile writing a search engine is hard, it is also incredibly rewarding. Over the past two years, we have brought up a meaningful crawl / index / serve pipeline for Neeva. Being able to create pages like <a href="https://neeva.com/search?q=tomato%20soup" rel="nofollow">https://neeva.com/search?q=tomato%20soup</a> or <a href="https://neeva.com/search?q=golang+struct+split" rel="nofollow">https://neeva.com/search?q=golang+struct+split</a> which are so much better than what is out there in commercial search engines is so worth it.We are private, ads free and customer paid.

评论 #32209607 未加载

评论 #32209217 未加载

kragen将近 3 年前

Useful context for this is that Anna Patterson started the search engine company Cuil in 02008, four years after writing this article (when she was still at Google). Its results were bad enough that the "Cuil Theory" meme was launched on Reddit and Tumblr making fun of it: <a href="https://knowyourmeme.com/memes/sites/cuil-theory" rel="nofollow">https://knowyourmeme.com/memes/sites/cuil-theory</a>:> One Cuil = One level of abstraction away from the reality of a situation.> Example: You ask me for a Hamburger.> 1 Cuil: if you asked me for a hamburger, and I gave you a raccoon.> 2 Cuils: If you asked me for a hamburger, but it turns out I don't really exist. Where I was originally standing, a picture of a hamburger rests on the ground.> 3 Cuils: You awake as a hamburger. You start screaming only to have special sauce fly from your lips. The world is in sepia.> 4 Cuils: Why are we speaking German? A mime cries softly as he cradles a young cow. Your grandfather stares at you as the cow falls apart into patties. You awake only to see me with pickles for eyes, I am singing the song that gives birth to the universe.<a href="http://cuiltheory.wikidot.com/" rel="nofollow">http://cuiltheory.wikidot.com/</a>Two years later, in 02010, the founders shut down the search engine, laid off all the employees, sold its patents to Google, and became Google employees again. Now all that remains of Cuil is the Cuil Theory Wiki.I'd be really interested to see a retrospective on what went wrong. I guess writing your own search engine really is hard, but I'd like to know what turned out to be so much harder than they expected.

评论 #32211922 未加载

jillesvangurp将近 3 年前

The hardest part of building your own search engine is not that it is technically hard (it's actually pretty easy) but that the bar for success is just really high. There are a few existing engines and they offer their services for free. So, not only is the bar really high, you have no viable revenue model and this stuff gets expensive quickly.Or put differently, if you are going to replicate what existing search engines already do, you are probably not going to be as good initially and you are going to struggle making money. Fixing the money part is the hard part.

t_mann将近 3 年前

Would be interesting to see stats from that time how many people were working on search engines and how it turned out for them. Did they end up getting acquired, at least funded for a while, exited, or just bootstrapped themselves until they realized there'll only be one winner.

jefftk将近 3 年前

This is pretty reasonable for 2004, but the problems have changed. Everything on that page is totally doable for a serious engineering team, and has been done many times. The real hard part is what they briefly touch on with:Don’t do page rank initially. Actually don’t do it at all. For this observation I risk being inundated with hate mail, but nonetheless don’t do page rank. If you four guys in your garage can’t get something decent-looking up without page rank, you’re not going to get anything decent up with page rank. Use the source, Luke—the HTML source, that is. Page rank is lengthy analysis of a global nature and will cause you to buy more machines and get bogged down on this one complicated step—this one factor in ranking. Start by exploiting everything else you can think of: Is the word in the title? Is it in bold? etc. Spend your time thinking about anything you can exploit and try it out.The web is full of sites that want to rank, since traffic makes money and appearing high in search results gets you traffic. Simple handling of HTML source is incredibly gameable, and while it might have worked okay on the web of 2004 it definitely is not enough now. It's you and your team verse an enormous number of SEO people.

评论 #32209742 未加载

wolfgang42将近 3 年前

I’ve been puttering away at making a search engine of my own (I should really do a Show HN sometime); let’s see how my experience compares with 18 years ago:Bandwidth: This is now also cheap; my residential service is 1 Gbit. However, the suggestion to wait until you’ve got indexing working well before optimizing crawling is IMO still spot-on; trying to make a polite, performant crawler that can deal with all the bizzare edge cases (<a href="https://memex.marginalia.nu/log/32-bot-apologetics.gmi" rel="nofollow">https://memex.marginalia.nu/log/32-bot-apologetics.gmi</a>) on the Web will drag you down. (I bypassed this problem by starting with the Stack Exchange data dumps and Wikipedia crawls, which are a lot more consistent than trying to deal with random websites.)CPU: Computers are really fast now; I’m using a 2-core computer from 2014 and it does what I need just fine.Disk: SATA is the new thing now, of course, but the difference these days is HDD vs SSD. SSD is faster: but you can design your architecture so that this mostly doesn’t matter, and even a “slow” HDD will be running at capacity. (The trick is to do linear streaming as much as possible, and avoid seeks at all costs.) Still, it’s probably a good idea to store your production index on an SSD, and it’s useful for intermediate data as well; by happenstance more than design I have a large HDD and a small SSD and they balance each other nicely.Storing files: 100% agree with this section, for the disk-seek reasons I mention above. Also, pages from the same website often compress very well against each other (since they’re using the same templates, large chunks of HTML can be squished down considerably), so if you’re pressed for space consider storing one GZIPped file per domain. (The tradeoff with zipping is that you can’t arbitrarily seek, but ideally you’ve designed things so you don’t need to do that anyway.) Also, WARC is a standard file format that has a lot of tooling for this exact use case.Networking: I skipped this by just storing everything on one computer; I expect to be able to continue doing this for a long time, since vertical scaling can get you very far these days.Indexing: You basically don’t need to write anything to get started with this these days! I’m just using bog-standard Elasticsearch with some glue code to do html2text; it’s working fine and took all of an afternoon to set up from scratch. (That said, I’m not sure I’ll continue using Elastic: it has a ton of features I don’t need, which makes it very hard to understand and work with since there’s so much that’s irrelevant to me. I’m probably going to switch to either straight Lucene or Bleve soon.)Page rank: I added pagerank very early on in the hopes that it would improve my results, and I’m not really sure how helpful it is if your results aren’t decent to begin with. However, the march of Moore’s law has made it an easy experiment: what Page and Brin’s server could compute in a week with carefully optimized C code, mine can do in less than 5 minutes (!) with a bit of JavaScript.Serving: Again, ElasticSearch will solve this entire problem for you (at least to start with); all your frontend has to do is take the JSON result and poke it into an HTML template.It’s easier than ever to start building a search engine in your own home; the recent explosion of such services (as seen on HN) is an indicator of the feasibility, and the rising complaints about Google show that the demand is there. Come and join us, the water’s fine!

评论 #32208388 未加载

评论 #32211795 未加载

wizofaus将近 3 年前

Doesn't mention the hardest part I found when developing a crawler - dealing with pages whose content is mostly dynamic and generated client side (SPA's). Even using V8 it's hard to do reliably and performantly at scale.

评论 #32207086 未加载

评论 #32211990 未加载

Rockjodd将近 3 年前

For those interested in search engines, recommend checking out Vespa.ai[1][2] - the engine behind several features at Yahoo.[1] <a href="https://vespa.ai/" rel="nofollow">https://vespa.ai/</a> [2] <a href="https://docs.vespa.ai/en/getting-started.html" rel="nofollow">https://docs.vespa.ai/en/getting-started.html</a>

forgotmypw17将近 3 年前

This is from the "doesn't scale" quadrant, but if you are not confident that your bot will behave well, shouldn't you supervise everything it does closely until you become confident?

amelius将近 3 年前

The article doesn't touch upon the hardest and most interesting part: NLP and finding the most relevant results. I would like to see a post on this.

评论 #32209642 未加载

cratermoon将近 3 年前

This bring back memories. Pentiums. IDE vs SCSI. Yahoo & dmoz. 10 kilobytes per web page (Even HN is 34K). Colo facilities.

ldjkfkdsjnv将近 3 年前

Theory I have:Text search on the web will slowly die. People will search video based content, and use the fact that a human spoke the information, as well as comments/upvotes to vet it as trustworthy material. Google search as we know it will slowly die, and then will decline like Facebook. TikTok will steal search marketshare as their video clips span all of human life.

评论 #32206793 未加载