Ask HN: Access to the Corpus - What Would You Do?

12 pointsby BrandonWatsonalmost 16 years ago

What if you had programmatic access to the entirety of the dataset of Google, Bing or Yahoo; every page they have crawled (including all meta data), all the searches performed and ads for display. If you had programmatic access to that dataset, what business would you create?One of my friends posed this question to me on Friday and my brain seized up. Creating another search engine made no sense, but the massive size of the data set and potential possibilities actually made my brain shut down.What would HN folks do?

13 comments

caffeinealmost 16 years ago

So .. Google have these data. Their hacking chops are not unremarkable. What have they done with it? Well, basically, you can ask for a word and they'll find other pages that feature that word ...And that's it.They have access to "the world's combined knowledge" and a zillion PhDs, and that's all they can do?! It's shocking. But it's not, really, because the data are basically useless without annotations.So let's go to Disneyland and pretend we have a genius NLP engine or an annotated web. Then,1) 360 on a company/product. In particular: who are all the stakeholders, and how do they feel? I'd sell this to e.g. analysts. Same thing on people's online identities.2) Memetracing. I'd sell this to advertisers. (So, we follow historical product releases and see exactly what memes spread about them and how, and through who. Related to (3) on ismarc's post)3) Rumors (this would require your feed to be real-time). I'd also probably sell this to stock traders (the idea here is to monitor e.g. forums frequented by GE employees to guess scoop)OK, so those are basic. More interesting:4) Organization-tracing. If you can label social graph edges with influence levels / information intakes, you can start playing with predicting organizational decision-making (behavioral economics / game theory ..There's a TED talk about this).5) Games: procedural content generation that looks really real, i.e. worlds full of people whose identities are plausible, whose interactions with others are plausible, etc.Those all require some analytical / NLP firepower on a scale which I don't think is really doable at the moment. The problem is that bags of words are meaningless without a social context - the data are pretty worthless unless your computer can figure out who it's important for and why.

评论 #780043 未加载

评论 #780556 未加载

telalmost 16 years ago

SEO? With something like that you could programattically find semantic locations with lower coverage and then sell the knowledge that you could possibly attack that keyword. For instance, I'd love to see a chart which plotted frequency of use in an English corpus against some hypothetical Google-coverage variable. Anything that's a strong outlier from something like an exponential curve could be a good target.

评论 #779970 未加载

ismarcalmost 16 years ago

You have to consider that all that data isn't just a list of web pages. That includes information provided by the web page and the contents of the page itself (as well as associated metadata).1) Create a map of the web (what links where and how), enabling an enhanced "browsing" experience (no more perusing a site for links to other interesting places)2) The contents of those pages contain a large volume of technical documentation as well as a large number of opinions of the technologies that rose from the documentation. With both, and a large enough history of the creation of pages, the release of the documentation and the opinions presented, a model can be built to predict the success rate of any particular technology.3) Given sufficient time, there is a large enough set of text that can be rendered to audio through speech synthesis (higher quality the better) in order to train a speech recognition system to a previously unseen level of accuracy.4) Given a proper algorithm, sites with security vulnerabilities can be discovered just from what is crawlable and most likely should not be accessed. From that list, and given the total number of unique websites contained in the database, you can calculate the ratio of harmful to potentially non-harmful websites and provide a risk threshold of any given link on any given page.5) Provide a search engine that uses regular expressions on different sets of the data (metadata, tags, text, text in a specific tag, etc.) to present a search engine with a previously unseen level of accuracy (accuracy of the results is dependent on the individual doing the searching, not the systems ability to guess that by std::list, I didn't mean "Sexually Transmitted Diseases STD List" (seriously, <a href="http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=std::list" rel="nofollow">http://www.google.com/search?sourceid=chrome&ie=UTF-8&#3...</a> has that as the second result).Edit: I failed at formatting

评论 #779924 未加载

notaddictedalmost 16 years ago

I'd like to see which search results were clicked on for a given search, and then associate those sites. And then do an amazon style: people who landed at this site also went to _____. As well as other relations based on user browsing, not based on site links.

评论 #780014 未加载

评论 #779918 未加载

bowmanalmost 16 years ago

I would use it to fight crime/corruption. It isn't too hard to identify people from a few google searches. Examples:Food places near X will give you a good indication where they live and their income Find what they are searching during normal work hours. Often this will give you where they work. Searches of themselves.Then just use this information to hunt them down or expose them.

评论 #780039 未加载

评论 #780044 未加载

评论 #780015 未加载

jhancockalmost 16 years ago

Create new tools similar to how Google, Yahoo and MS already do when leveraging this asset. I know this is somewhat a BS answer, but really what other answer is there? This is like asking the question "What new features should Google, Yahoo or MS build on top of their search?"

评论 #779873 未加载

andyleclairalmost 16 years ago

Download all of the porn on the internet. I'm only half joking.

yannisalmost 16 years ago

I would keep the meta data and reverse engineer the algo!

评论 #779874 未加载

snitkoalmost 16 years ago

Sell it to someone, who'd know what to do with it?

评论 #779963 未加载

Diakronikalmost 16 years ago

I wouldn't create a business. I'd write a program that could learn (text-based) language. Then I'd write it up and submit it to Computational Linguistics. Then I'd die in obscurity as someone took my idea and figured out how to monetize it.

derefralmost 16 years ago

Start the world's most underhanded (and successful) SEO firm.

trevelyanalmost 16 years ago

Better machine translation.

CamperBobalmost 16 years ago

This scenario is reminiscent of what happened a few years back when AOL released their whole corpus of search queries with associated user-ID hashes. Most of what was done with that data set amounted to amateur CSI work and general meanness. I don't recall any profound revelations or can't-miss business opportunities coming out of it. The value of raw data is overrated.