Google Exec Says It's A Good Idea: Open The Index And Speed Up The Internet

235 pointsby helwralmost 14 years ago

38 comments

geekfactoralmost 14 years ago

It strikes me that the entire article/proposal is based on a faulty premise:"After all, the value is not in the index it is in the analysis of that index."The ability for a given search engine to innovate is based on having control of the index. The line between indexing and analysis isn't quite as clean as what is implied by the article, if only for the simple fact that you can only analyze what is in the index.For example, at it's simplest and index is a list of what words are in what document on the web. But what if I want to give greater weight to words that are in document titles or headings? Then I need to somehow put that into the index.What if I want to use the proximity between words to determine relevance of a result for a particular phrase? Need to get that info into the index, too.In the end, what the author really wants is for someone to maintain a separate copy of the internet for bots. In order for someone to do that, they'd need to charge the bot owners, but the bot owners could just index your content for free, so why would they pay?

评论 #2585659 未加载

评论 #2586071 未加载

jedbergalmost 14 years ago

No one seems to remember that Amazon did this 5 years ago: <a href="http://arstechnica.com/old/content/2005/12/5756.ars" rel="nofollow">http://arstechnica.com/old/content/2005/12/5756.ars</a>

评论 #2585860 未加载

panicalmost 14 years ago

What "index" is this article asking Google to open? The index against which they run actual queries has to be tied to Google's core search algorithms, which I doubt they'd want to make public.So would they open an "index" of web page contents? In this case, why would another search engine access Google's "index" rather than the original server? The original server is guaranteed to be up to date, and there's no single point of failure.

评论 #2585391 未加载

评论 #2585376 未加载

评论 #2585464 未加载

trotskyalmost 14 years ago

I guess I don't understand, if someone provides me with a storage cluster of the_whole_internet for free, won't my proprietary_search_algorithm significantly degrade the IOPS and network bandwidth of the storage? Where would it all be? In some google data center that now anyone can demand colocation in? What happens when I accidentally or maliciously slow down Bing's updates and degrade their quality? And, as others mentioned, what happens when people push data into the index that doesn't represent what they're hosting?It seems like this would be quite a complex project with a for the public good approach. Maybe it could work as an AWS project to sell amazon compute cycles.

评论 #2585821 未加载

sebastianavinaalmost 14 years ago

yeah, sure... lets make a system and store all kind of information there, so people can browse it... it would be great to distribute it around the world, maybe on different companies, and sync the data every day so it keeps fresh... I don't know, maybe we can even have every person store their own data on their own private server... but of course, in an open index... </sarcasm>

评论 #2585278 未加载

评论 #2585166 未加载

评论 #2585217 未加载

chrislomaxalmost 14 years ago

I think this is a good idea. The whole idea of people syncing their own data doesn't work though, it gives too much room for people to fudge their data into the system so it favours them more.I think the idea is good though. I think there would be a fight though to say who is the aggregator of the information. This would also mean whoever does distribute it has a stranglehold on the industry in terms of how and when it supplies this information.I can see it's uses but I can equally see a lot of cons for the system not working or some serious amount of anti trust.If you could get an unbiased 3rd party involved though and they built the database then I think that would work.

Emorealmost 14 years ago

For the record, the Google exec (Berthier Ribeiro-Neto) is the co-author of "Modern Information Retrieval" [1], an excellent book and close to a standard text on IR.[1] <a href="http://www.amazon.com/Modern-Information-Retrieval-Ricardo-Baeza-Yates/dp/020139829X" rel="nofollow">http://www.amazon.com/Modern-Information-Retrieval-Ricardo-B...</a>

评论 #2585618 未加载

extensionalmost 14 years ago

We're talking about the cache, right? The index, or more likely indices, are optimized data structures used to search the cache. I doubt Google could share those without revealing too much about their ranking algorithm.Letting sites inject into the cache is an interesting idea, but Google will still have to spider periodically to ensure accuracy. Inevitably, a large number of sites will just screw it up, because the internet is mostly made of fail. This would leave Google with only bad options: If they delist all the sites to punish them, they leave a significant hole in their dataset. But if they don't punish them and just silently fix it by spidering, there is no longer any threat to keep the black hat SEOs in check. Either way, it would cause an explosion in support requirements and Google is apparently already terrible at that.

评论 #2587238 未加载

ChuckMcMalmost 14 years ago

"Each of these robots takes up a considerable amount of my resources. For June, the Googlebot ate up 4.9 gigabytes of bandwidth, Yahoo used 4.8 gigabytes, while an unknown robot used 11.27 gigabytes of bandwidth. Together, they used up 45% of my bandwidth just to create an index of my site."I don't suppose anyone has considered making an entry in robots.txt that says either:last change was : <parsable date>Or a URL list of the form<relative_url> : <last change date>There are a relatively small number of robots (a few 10's perhaps) which crawl your web site, all of the legit ones provide contact information either in the referrer header or on their web site. If you let them know you had adopted this approach then they could very efficiently not crawl your site.That solves two problems;@ web sites on the back end of ADSL lines but don't change often wouldn't have their bandwidth chewed by robots,@ The search index would be up to date so if someone who needed to find you hit that search engine they would still find you.

评论 #2585801 未加载

评论 #2585629 未加载

SoftwareMavenalmost 14 years ago

A couple of thoughts come to mind:1. If I were Microsoft, I wouldn't trust Google's index. How do I know they aren't doing subtle things to the index to give them an advantage?2. Having the resources to keep a live snapshot of the web is one of the big players' advantages. Opening the index, while good for the web, would not necessarily be good for the company. Google could mitigate that by licensing the data: for data more than X hours old, you get free access; for data newer than that, you pay a license fee to Google. Furthermore, integrate the data with Google's cloud hosting to provide a way to trivially create map/reduce implementations that use the data.3. On the other side, what a great opportunity the index could provide for startups. Maintaining a live index of the web is costly and getting more and more difficult as people lock down their robots.txt. Being able to immediately test your algorithms against the whole web would be a godsend for ensuring your algorithms work with the huge dataset and that your performance is sufficient.Here's to hoping Google goes forward with it!

thevivekpandeyalmost 14 years ago

The first step would be for some top companies (Google, Yahoo...) to share the index. That way, there would be some speed up of the internet, and the index would not be open to abuse by arbitrary people/companies.

mmaunderalmost 14 years ago

The author should use something like "crawl data" instead of "index". An index is the end result of analyzing crawled web pages.It's a cool idea though because Yahoo sucks up a ton of my bandwidth and delivers very little in SEO traffic. On most of my sites now I have a Yahoo bot specific Crawl-Delay in robots.txt of 60 seconds, which pretty much bans them.

stretchwithmealmost 14 years ago

Maybe each site should be able to designate who indexes it and robots can get that index from that indexer. Let the indexers compete. Let each site decide how frequently it can index. Allow the indexer that gets the business use the index immediately, with others getting access just once a day. Perhaps a standardized raw sharable index format could be created, with each search company processing it further for their own needs after pulling it.And let the site notify the indexer when things change, so all the bandwidth isn't used looking for what's changed. Actual changes could make it in to the index more quickly if the site could draw attention to it immediately rather than an army of robots having to invade as frequently as inhumanly possible. The selected indexer could still visit once a day or week to make sure nothing gets missed.

ck2almost 14 years ago

Google would never do this.Their attitude is to take everything in but not to let you automate searches to get data out.This is the biggest problem I have with search engines - you want to deep index all my sites? Fine, but you better let me search in return - deeper than 1000 results (and ten pages). Give us RSS, etc.

评论 #2585326 未加载

评论 #2585569 未加载

sigilalmost 14 years ago

"Index" is the wrong word. He's not calling for Google to open up their index, but rather open their webcache.

random42almost 14 years ago

This article is about an year old. [july, 2010]

jwralmost 14 years ago

It strikes me that both in the article and in most comments people have no idea of what they are talking about, and yet they boldly carry on."The index"? Feature extraction is the most complex part of almost any machine learning algorithm, and search is no different. Indexing full text documents is a really difficult task, especially if you take inflected languages into account (English is particularly easy).I don't see a way to "open the index" without disclosing and publishing a huge amount of highly complex code, that also makes use of either large dictionaries, or huge amounts of statistical information. It's not like you can just write a quick spec of "the index" and put it up on github.FWIW, I run a startup that wrote a search engine for e-commerce (search as a service).

198dalmost 14 years ago

I don't think it's quite that simple. The index that google serves search query results from is a direct result of the algorithms they've applied to the data the googlebot has gathered. If by 'index' the author means the data the goolebot (for example) has downloaded from the internet, that's quite a bit different, but still probably serves the purpose the author is looking for. The index is a highly specialized representation of all the data they've collected.

评论 #2585295 未加载

mindstabalmost 14 years ago

Does it seem naive to anyone else to allow site owners to update the index and stop spidering. a) lots of people for various reasons (ignorance, security through obscurity) would just not update it and stuff would fall out of search. Second, this seems incredibly ripe for abuse. Like we don't have enough search spam result problems already, letting spammers have more direct access to the content going into their rankings seems like a truly bad idea.

tlbalmost 14 years ago

When spiders use more bandwidth than customers, your website must not be very popular. It implies that each page is viewed only a handful of times / month on average.

评论 #2587214 未加载

eykanalalmost 14 years ago

Good article. The fact is, the index itself isn't worth nearly as much as the algorithms. Heck, open the index, and let anyone add to it. MSN, Yahoo, Bing, anyone... let them add to that single index and make the index awesome, and then anyone can try their hand at making a great search algorithm. If each company really thinks their search algorithm is better than everyone else's, this is competition at it's best.

评论 #2585233 未加载

SkimThatalmost 14 years ago

TL;DR - A lot of traffic on the Internet comes from search engine bots like Google’s and Yahoo’s indexing pages. If Google’s index was open, search engines could share each other’s resources and not have to repeatedly spider pages. This would significantly boost traffic speed and the idea was even supported by Larry Page, one of, Google’s co-founders. Page initially resisted Google going commercial.

braindead_inalmost 14 years ago

The title is a bit misleading. The author suggested it and the Google Brazil head supported it and said 'You should write a position paper on it'.

tlrobinsonalmost 14 years ago

What format would the indexes be made available in? Raw lists of URLs and caches of the HTML pages, or pre-built inverted indexes, PageRank data, etc?If it's the former all this really does is move the burden from sites to Google, and introduces a single point of failure.If it's the latter, which seems unlikely, what incentive does Google have to share that data? It's part of their competitive advantage.

bkudriaalmost 14 years ago

Google has a ton of private data that should not have been indexed, in their index. It's just that no one has thought to search for it yet. (See: <a href="http://en.wikipedia.org/wiki/Johnny_Long" rel="nofollow">http://en.wikipedia.org/wiki/Johnny_Long</a>)A single public index would expose this data to stronger analysis (or even plain reading), not just Google search queries.

redditmigrantalmost 14 years ago

I dont know if this is naive, but wouldnt the data model/storage strategy of the index be influenced by the ranking algorithms that use it? If thats the case, then I would presume google's index tries to store the data in a form thats efficient for their ranking algorithm to work off of and it might not be in the best format for say bing/yahoo to use.

评论 #2585318 未加载

评论 #2585526 未加载

dennisgorelikalmost 14 years ago

Centralization [of search index] has significant overhead.Bandwidth is not nearly as expensive as the overhead of such search index centralization.

ecaradecalmost 14 years ago

Even if it would be beneficial to the whole internet, but if google did that that would be like giving an advantage to all the google competitors : they wouldn't need to solve the crawling problem. It may not be algorithmically gorgeous but it's still one problem less. Would be fun though, we could buy a tarball of the whole internet ;)

robotalmost 14 years ago

Also, why not use a single base station at each location for all mobile mobile service providers? Rather than having multiple 3G base stations for each provider, polluting our radio space? I think when there is competition, there is always a multiple of something, it's just a fact of open market and we may have to live with it.

bluelualmost 14 years ago

At the end, one single company will control the internet. I hope this is not something you want. Like twitter does control twitter and only opens up their data to gnip, etc...This won't be accepted. And even legally, this is not possible due to copyright laws in different countries.

endlessvoid94almost 14 years ago

I have a potentially stupid question. When the author says "45% of my bandwidth", does he mean 45% of a QUOTA? Or actually 45% of the pipe is being used?If it's the former, this seems like it wouldn't help speed at all.

评论 #2585687 未加载

评论 #2585271 未加载

Apple-Guyalmost 14 years ago

The Google guy does not work in the head office, and isn't in charge of policy. He doesn't understand that search -> ads is what earns the Google riches.

stcredzeroalmost 14 years ago

If Google and a few other companies can charge some multiple of what it costs to index a site, then it could even be a money making prospect for them.

benwerdalmost 14 years ago

Well, on one level, it's a great idea. On another, it gives Google the keys to the entire freaking web.

brianobushalmost 14 years ago

part of the secret that makes any search engine unique is the knowledge that a site at x.com exists and they know there is a forum at x.com/forums which is not visible by simply crawling from the root x.com. On the other hand, I would love an open web cache for my work.

joshaidanalmost 14 years ago

While I think this is a really cool idea, for some reason the word hiybbprqag comes to mind. :)

ddemchukalmost 14 years ago

The reason Google (and Bing and Yahoo and Yandex and and and) is in the position they are in is because they have the bandwidth and computational power to crawl and index the web with the speed and reach necessary for it to be useful. They aren't going to just start giving that away any time soon...

agentultraalmost 14 years ago

There are protocols for bots. Not all of them follow it... so block requests from them.Problem solved... like a million internet years ago.

评论 #2585518 未加载