Wouldn't it be fun to build your own Google?

150 点作者 martinkl超过 10 年前

16 条评论

Smerity超过 10 年前

[lightly modified version of a comment I put on the article as I love HN for discussion!]Great article -- we're excited there's so much interest in the web as a dataset! I'm part of the team at Common Crawl and thought I'd clarify some points in the article.The most important is that you can download all the data that Common Crawl provides completely for free, without the need to pay S3 transfer fees or process it only in an EC2 cluster. You don't even need to have an Amazon account! Our crawl archive blog posts give full details for downloading[1]. The main challenge then is storing it, as the full dataset is really quite large, but a number of universities have pulled down a significant portion onto their local clusters.Also, we're performing the crawl once a month now. The monthly crawl archives are between 35-70 terabytes compressed. As such, we've actually crawled and stored over a quarter petabyte compressed, or 1.3 petabytes uncompressed, so far in 2014. (The archives go back to 2008.)Comparing directly against the Internet Archive datasets is a bit like comparing apples to oranges. They store images and other types of binary content as well, whilst Common Crawl aims primarily for HTML, which compresses better. Also, the numbers used for Internet Archive were for all of the crawls they've done, and in our case the numbers were for a single month's crawl.We're excited to see Martin use one of our crawl archives in his work -- seeing these experiments come to life the best part of working at Common Crawl! I can confirm that optimizations will help you lower that EC2 figure. We can process a fairly intensive MR job over a standard crawl archive in afternoon for about $30. Big data on a small budget is a top priority for us![1]: <a href="http://blog.commoncrawl.org/2014/11/october-2014-crawl-archive-available/" rel="nofollow">http://blog.commoncrawl.org/2014/11/october-2014-crawl-archi...</a>

评论 #8731780 未加载

评论 #8734586 未加载

评论 #8733273 未加载

评论 #8732241 未加载

评论 #8731821 未加载

pjbrunet超过 10 年前

At first Google was a search algorithm, but at some point they decided to have humans review and rank the important queries. Important as in query volume.Why use humans? People can decide if your navigation is intuitive. They can decide if your page looks like crap. If 230,000 people are searching for "coconut oil" per month (actual numbers) then it's worth having an intern spend 15 minutes to make sure page 1 of "coconut oil" looks right.Google can afford that. They need a human to decide if the "user experience" is actually good vs. disallowing the back button and forcing the browser to crash, which is how I suppose you could fake a "time on site" metric if this was just an algorithmic problem.Google is now more like playing Zork. You type "Go North" like 10 million other people before you typed "Go North" and Google has already crafted that experience you'll find in next room. (Which makes me wonder, do they score how boring you are based on predictability?) This is becoming more and more obvious over time as a search for "calculator" shows you an actual calculator that a human at Google created. That's not an algorithmic response.Similarly, I see that human touch coming more into play with voice recognition, Google Glass, Siri, etc. Call that "AI" or whatever. You ask Google a question and Google has already sculpted a slick answer based on tons of testing. That's how I see Google as a search engine now. Part of the crawling is interesting (recognizing objects in photos?) but I think human reviews of all the important websites and SERPs, that's harder for a competitor to reproduce.

评论 #8733727 未加载

mark_l_watson超过 10 年前

A really nice idea.I volunteered a bit early this year for Common Crawl (not much, just some Java and Clojure examples for fetching and using the new archive format).Common Crawl already has many volunteers (and a professional management and technical staff) so it would seem like a good idea to merge some of the author's goals with the existing Common Crawl organization. Perhaps more frequent Common Crawl web fetches and also making the data available on Azure, Google Cloud, etc. would satisfy the author's desire to have more immediacy and have the data available from multiple sources.

评论 #8734820 未加载

JDDunn9超过 10 年前

I've always wanted to experiment with my own search algorithm. Unfortunately, I think this is still out of the budget of average programmers. Just the hard drives to download 1.3 petabytes would cost six-figures.[1][2][1] <a href="https://www.backblaze.com/petabytes-on-a-budget-how-to-build-cheap-cloud-storage.html" rel="nofollow">https://www.backblaze.com/petabytes-on-a-budget-how-to-build...</a>[2] <a href="https://www.backblaze.com/blog/why-now-is-the-time-for-backblaze-to-build-a-270-tb-storage-pod/" rel="nofollow">https://www.backblaze.com/blog/why-now-is-the-time-for-backb...</a>

smoyer超过 10 年前

A couple thoughts:1) I like the idea of human curation, but in combination with some sort of automated crawler (or other tool) that helps in the browser.2) Why can't we also distribute the act of crawling, the maintenance of the index and the map-reduce (or other algorithm) that produces the data.I've been thinking about architectures that would allow (in essence) a P2P search system. Would anyone be interested in talking about architectures to make this work? There are millions of computers on the web at any given time ... if it's built into the browser (or plugs in), you could have human input at the same time.

andrewhillman超过 10 年前

Yeah, this sounds all well and good in theory, but after visiting thousands of sites over the years, it might be a better idea to help engineers build a search engine for their own site/data first. I can't recall many websites that have amazing search. It's a problem when I have to use google to find what I want on xyz.com because if I go search for what I am looking for on xyz.com I cannot find it even if I know its on that site.It would be so nice to go to xyz.com and actually find what I am looking for in under 1 second.

评论 #8731355 未加载

评论 #8731545 未加载

评论 #8731235 未加载

angersock超过 10 年前

For anyone interested, there's a hilariously bitter and practical paper on the trials and tribulations of building a search engine:<a href="http://queue.acm.org/detail.cfm?id=988407" rel="nofollow">http://queue.acm.org/detail.cfm?id=988407</a>EDIT:Article is clearly from an earlier era, but it's really cool to see how far we've come and how much more computing power we have available now. There are entire categories of problems that simply don't exist anymore.

评论 #8731877 未加载

评论 #8737043 未加载

ryanthejuggler超过 10 年前

This would be really cool to participate in, especially if it could be packaged à la Folding@Home/SETI@Home and widely distributed. I wonder if there's some clever method using crypto that can provably discourage bad actors if the network has certain properties (e.g. Bitcoin is nearly impossible to cheat unless one group owns >50% of the network).

discardorama超过 10 年前

Google's power comes not from the crawling, but from the retrieval and ranking. They use many more signals than the hyperlinks and anchor text (which is all you'd have if you crawled yourself). Indexing crawled content would have been OK in the year 2000; but today, the users demand more. Relevance is the top priority, and no one does it better than El Goog.

评论 #8732304 未加载

评论 #8732405 未加载

swah超过 10 年前

Maybe more people should start crawling and seeing what they can extract? I remember seeing DuckDuckGo Instant Answers and thinking what a valuable resource that would be (having a database like DDG must have, I mean).Then one would be able to do some "stuff Google can do" - say, analysing trends - albeit with worse sampling, and not depend that much on them.

sparkzilla超过 10 年前

The problem with algorithmic/scraper search methods, is that they only work with existing data. For example, most Google searches gives a list of websites on one side, and some data scraped from Wikipedia on the other. There is not much meaning there. That's because Google's algorithm cannot combine the results into something original, because that would require human creativity. As such, I see the rise of different kinds of search based on what humans create, rather than what computers can scrape. I wrote a (longish) blog post on this problem: <a href="http://newslines.org/blog/googles-black-hole/" rel="nofollow">http://newslines.org/blog/googles-black-hole/</a>

评论 #8732919 未加载

dmritard96超过 10 年前

suprised not to see a mention of a bloom filter in url dedupe. Another tough problem now is the portion of the web in walled gardens or that is expensive to crawl (needs a js context).

mjklin超过 10 年前

I thought Wikimedia tried this once. Big announcement, then nothing. Is that code still available?

评论 #8731790 未加载

评论 #8731710 未加载

thewarrior超过 10 年前

Hmmm I'd think that ChuckMcM would have some interesting views about this.

imranq超过 10 年前

what about Algolia? HN uses it

smartpants超过 10 年前

Good Read