Data is at the heart of search, but who has access to it?

105 pointsby dpwabout 10 years ago

20 comments

ChuckMcMabout 10 years ago

Sigh, this is incorrect.edit: incorrect is perhaps too strong, it is incomplete.While it is true that click tracking can be used as a relevance signal, the people who were really pissed off when the data stream got dumped were advertisers who wanted to buy AdWords. That was a very simple system, pay someone for clickstream data, extract trending queries, front those with AdWord buys to get your page on the top of Google's results, and profit.Having built a search engine and run it for 5 years, we got to see what people felt was relevant and what wasn't in a very loose way with click stream data. Basically you have a query and 10 blue links you can split the results in quartiles and figure out if the thing they clicked on was top half, bottom half, top quarter/second quarter etc. And do A/B testing to see how that played out. But what we found was that the best indication of what a page was about, was the text that linked to it. If you have an in-link to a page which was "<href='page'>great radio site"[1] then "great radio site" would be a query that should return that page which might be titled something like "bob's electromagnetic spectrum imaginarium" or something equally unlikely to come up in a query string.So the bottom line is that there are lots of ways to try to determine relevance, click stream data is a part of that but by no means the biggest factor.[1] neutered html for obvious reasons.

评论 #9292792 未加载

评论 #9291636 未加载

评论 #9292001 未加载

jfuhrmanabout 10 years ago

>In Germany, for example, where Google has over 95% market share, competing search engines don’t have access to adequate past search data to deliver search results that are as relevant as Google’s. And, because their search results aren’t as relevant as Google’s, it’s difficult for them to attract new users. You could call it a vicious circle.This is interesting because of the browser choice enforced by the EU on Windows. IE whose default is Bing lost share to other browsers like Chrome, Firefox and Opera which all had Google as the default. So an attempt to fix the browser market totally distorted the Web Search market. I wonder why MS didn't request to the EU that the alternate browsers in the browser choice screen had to have Bing as the default search.I wonder if the EU will mandate that search relevancy data must be shared by Google with rival search engines like DDG just like they mandated that SMB shares and Office formats must be documented by MS and released to developers.

评论 #9290546 未加载

solveabout 10 years ago

Other than the index data, there's something even bigger.Google's biggest PR success is convincing everyone that the quality of web rankings depends almost purely on algorithms. It does not. What allows Google to hold their monopoly is the $100s of millions (or more) they continuously pay to amass more manually created training data:<a href="http://www.theregister.co.uk/2012/11/27/google_raters_manual" rel="nofollow">http://www.theregister.co.uk/2012/11/27/google_raters_manual</a><a href="http://www.forbes.com/sites/timworstall/2012/11/27/is-googles-algorithm-really-just-1500-homeworkers/" rel="nofollow">http://www.forbes.com/sites/timworstall/2012/11/27/is-google...</a>A new search engine could appear today with algorithms 10x better than Google, but without access to this scale of training data, their rankings wouldn't even be close to Google's quality.Google maintains their position by paying cash for this monopoly on training data made by tens of thousands of $9/hour workers, not through superior algorithms!

bobajeffabout 10 years ago

I think a problem that is happening here is that there is no competition in search just like there is no competition in social networks and operating systems. Not like there are for things like automobiles, electronics and clothing.Computers introduce a means to lock people in that don't exist in other markets. In software products there are often ecosystems that tie directly in to the product/service which are not required to be shared with competitors unlike with road systems for cars.Regulators ought to look into ways to enforce measures that require the companies to completely open their ecosystem to competitors. Or look into ways to standardize these ecosystems and require every service/application/website comply with them (similar to how media companies are forced to include closed captioning).

评论 #9290272 未加载

sanxiynabout 10 years ago

In South Korea, Google's market share is below 5%, and Naver gets more than 80% of search queries. I think this is the reason why Google's search results for Korean contents are not as good as contents in other languages.

评论 #9291018 未加载

jjoeabout 10 years ago

So the whole push for SSL/https from Google has been opportunistic rather than good practice. I mean why would a search engine go as far as to make SSL a ranking signal?

评论 #9290225 未加载

评论 #9291958 未加载

评论 #9291883 未加载

ocdtrekkieabout 10 years ago

It makes you wonder how many changes were made for "privacy" and how many changes were made for "protecting our business".

评论 #9289950 未加载

pclabout 10 years ago

Interesting. I wonder to what extent this reasoning was behind executive support of the Chrome project, and whether it was a factor from the onset or something that Google stumbled upon after developing a browser.

评论 #9289858 未加载

评论 #9294592 未加载

ntakasakiabout 10 years ago

>In 2011, Google famously accused Microsoft’s Bing search engine of doing exactly that: logging Google search traffic in Microsoft’s own Internet Explorer browser in order to improve the quality of Bing results.MS didn't do that from IE, they did for users who installed the Bing bar, a huge difference.

Metapilotabout 10 years ago

I think the author's perspective is skewed in order to stay in line with the title. Here's an example of why I say that:The author states that "For some 90% of searches, a modern search engine analyzes and learns from past queries, rather than searching the Web itself, to deliver the most relevant results." This may be true in some types of searches but overall, I think the statement is misleading.Rather, it's better to think of it like this: One important part of the algorithmic process involves constantly crawling the web and updating the index with new information. (Important / frequently-updated web sites may get crawled all day every day, while ones that are less important may get crawled only weekly or monthly). Meanwhile, another part of the algorithmic process constantly analyzes new info discovered in the crawl and combines it with, as the author-mentioned, click-through data learned from past queries.The answers to many queries don't change, while the answers to many other queries deserve freshness. For example, I'm quite certain Einstein's date of birth hasn't changed in quite a while, but his theory of relativity is in constant discussion and there is always new information and new queries pertaining to it. As a result, there is not much need for a search engine to go digging for the latest info on an "einstein's birthday" query, but it's to everyone's advantage that Google is able to identify which pages on the web deserve priority crawling and that Google has retrieved and incorporated the fresh info those pages contain into its index when it comes to a topical type of query like "diffraction of light with quantum physics".In the end, the results to every query depend on info gathered from the web and user data helps refine the results. Info that is more static can be prioritized with more input from click-through data, while new information found on the web must rely more on Google's artificial intelligence to push it up in front of searchers.Another reason that that "90%" statement sticks out to me is that there is a fairly often-used factoid tossed around industry experts that between "6% to 20% of queries that get asked every day have never been asked before." Google can't rely heavily on past query data for all of these type of searches.

评论 #9291764 未加载

wmfabout 10 years ago

So does Mozilla's contract with Yahoo allow Mozilla to track query data and maybe feed it to underdog search engines like DDG or Blekko (oops)?

评论 #9289914 未加载

评论 #9289798 未加载

ekrabout 10 years ago

So that's why Google created the Chrome browser.

评论 #9290370 未加载

minthdabout 10 years ago

So, since Google tracks the full browsing experience of chrome users, and hence gets more relevant data than for other browsers users, it has the theoretical ability to offer better search results to chrome users.Has anybody noticed this happening ?

评论 #9290104 未加载

tokaiabout 10 years ago

Training data is nice, but I think its important not to underestimate capacity for crawling. IMO one of Googles strengths is that they crawl large quantities of new content. Smaller operations like DDG can't crawl at that scale. If I want discussion new bugs, search the articles at my favorite newspage (where the inhouse search is unusable), or just want the newest blogpost on some subject - Google is hard to beat.

PaulHouleabout 10 years ago

At this point Google is not winning because it's search results are good (have you used Google recently?), it is winning because it makes almost 10x as much revenue as other search engines do per view -- at that rate any other search engine is running a charity.

评论 #9290058 未加载

评论 #9290022 未加载

countrybama24about 10 years ago

Seems like there is a business opportunity to build a plugin of sorts that allows users to opt in and share their search data with competing platforms. I'd be interested in donating my data to help a rival engine compete with Google.

thallukrishabout 10 years ago

Only when user can own his data which means Apps are just logics and user can allow access to whomever whatever selectively we can suddenly find more genuine things reaching the user be it commerce or content.

thrownaway2424about 10 years ago

It is unsettling to read this kind of chip-on-my-shoulder opinion piece full of innuendo under the Firefox logo and the Mozilla name but on the author's personal domain.

Semiapiesabout 10 years ago

TL;DR - Yahoo! still exists and resents Google. But not for being better in their niche, no. Just for delivering a better service, which is not at all the same thing. Somehow.

asuffieldabout 10 years ago

(Tedious disclaimer: my opinion, not my employers. Not representing anybody else. I work at Google, not on search quality)This article makes a number of bold claims about the contents of data and code which its author hasn't seen, and is written by a company that is receiving a large amount of money from Yahoo. I would encourage people not to forget these details.

评论 #9290373 未加载