TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Access to the Corpus - What Would You Do?

12 点作者 BrandonWatson将近 16 年前
What if you had programmatic access to the entirety of the dataset of Google, Bing or Yahoo; every page they have crawled (including all meta data), all the searches performed and ads for display. If you had programmatic access to that dataset, what business would you create?<p>One of my friends posed this question to me on Friday and my brain seized up. Creating another search engine made no sense, but the massive size of the data set and potential possibilities actually made my brain shut down.<p>What would HN folks do?

13 条评论

caffeine将近 16 年前
So .. Google <i>have</i> these data. Their hacking chops are not unremarkable. What have they done with it? Well, basically, you can ask for a word and they'll find other pages that feature that word ...<p>And that's it.<p>They have access to "the world's combined knowledge" and a zillion PhDs, and that's <i>all</i> they can do?! It's shocking. But it's not, really, because the data are basically useless without annotations.<p>So let's go to Disneyland and pretend we have a genius NLP engine or an annotated web. Then,<p>1) 360 on a company/product. In particular: who are <i>all</i> the stakeholders, and how do they feel? I'd sell this to e.g. analysts. Same thing on people's online identities.<p>2) Memetracing. I'd sell this to advertisers. (So, we follow historical product releases and see exactly what memes spread about them and how, and through who. Related to (3) on ismarc's post)<p>3) Rumors (this would require your feed to be real-time). I'd also probably sell this to stock traders (the idea here is to monitor e.g. forums frequented by GE employees to guess scoop)<p>OK, so those are basic. More interesting:<p>4) Organization-tracing. If you can label social graph edges with influence levels / information intakes, you can start playing with predicting organizational decision-making (behavioral economics / game theory ..There's a TED talk about this).<p>5) Games: procedural content generation that looks really real, i.e. worlds full of people whose identities are plausible, whose interactions with others are plausible, etc.<p>Those all require some analytical / NLP firepower on a scale which I don't think is really doable at the moment. The problem is that bags of words are meaningless without a social context - the data are pretty worthless unless your computer can figure out who it's important for and why.
评论 #780043 未加载
评论 #780556 未加载
tel将近 16 年前
SEO? With something like that you could programattically find semantic locations with lower coverage and then sell the knowledge that you could possibly attack that keyword. For instance, I'd love to see a chart which plotted frequency of use in an English corpus against some hypothetical Google-coverage variable. Anything that's a strong outlier from something like an exponential curve could be a good target.
评论 #779970 未加载
ismarc将近 16 年前
You have to consider that all that data isn't just a list of web pages. That includes information provided by the web page and the contents of the page itself (as well as associated metadata).<p>1) Create a map of the web (what links where and how), enabling an enhanced "browsing" experience (no more perusing a site for links to other interesting places)<p>2) The contents of those pages contain a large volume of technical documentation as well as a large number of opinions of the technologies that rose from the documentation. With both, and a large enough history of the creation of pages, the release of the documentation and the opinions presented, a model can be built to predict the success rate of any particular technology.<p>3) Given sufficient time, there is a large enough set of text that can be rendered to audio through speech synthesis (higher quality the better) in order to train a speech recognition system to a previously unseen level of accuracy.<p>4) Given a proper algorithm, sites with security vulnerabilities can be discovered just from what is crawlable and most likely should not be accessed. From that list, and given the total number of unique websites contained in the database, you can calculate the ratio of harmful to potentially non-harmful websites and provide a risk threshold of any given link on any given page.<p>5) Provide a search engine that uses regular expressions on different sets of the data (metadata, tags, text, text in a specific tag, etc.) to present a search engine with a previously unseen level of accuracy (accuracy of the results is dependent on the individual doing the searching, not the systems ability to guess that by std::list, I didn't mean "Sexually Transmitted Diseases STD List" (seriously, <a href="http://www.google.com/search?sourceid=chrome&#38;ie=UTF-8&#38;q=std::list" rel="nofollow">http://www.google.com/search?sourceid=chrome&#38;ie=UTF-8&#3...</a> has that as the second result).<p>Edit: I failed at formatting
评论 #779924 未加载
notaddicted将近 16 年前
I'd like to see which search results were clicked on for a given search, and then associate those sites. And then do an amazon style: people who landed at this site also went to _____. As well as other relations based on user browsing, not based on site links.
评论 #780014 未加载
评论 #779918 未加载
bowman将近 16 年前
I would use it to fight crime/corruption. It isn't too hard to identify people from a few google searches. Examples:<p>Food places near X will give you a good indication where they live and their income Find what they are searching during normal work hours. Often this will give you where they work. Searches of themselves.<p>Then just use this information to hunt them down or expose them.
评论 #780039 未加载
评论 #780044 未加载
评论 #780015 未加载
jhancock将近 16 年前
Create new tools similar to how Google, Yahoo and MS already do when leveraging this asset. I know this is somewhat a BS answer, but really what other answer is there? This is like asking the question "What new features should Google, Yahoo or MS build on top of their search?"
评论 #779873 未加载
andyleclair将近 16 年前
Download all of the porn on the internet. I'm only half joking.
yannis将近 16 年前
I would keep the meta data and reverse engineer the algo!
评论 #779874 未加载
snitko将近 16 年前
Sell it to someone, who'd know what to do with it?
评论 #779963 未加载
Diakronik将近 16 年前
I wouldn't create a business. I'd write a program that could learn (text-based) language. Then I'd write it up and submit it to Computational Linguistics. Then I'd die in obscurity as someone took my idea and figured out how to monetize it.
derefr将近 16 年前
Start the world's most underhanded (and successful) SEO firm.
trevelyan将近 16 年前
Better machine translation.
CamperBob将近 16 年前
This scenario is reminiscent of what happened a few years back when AOL released their whole corpus of search queries with associated user-ID hashes. Most of what was done with that data set amounted to amateur CSI work and general meanness. I don't recall any profound revelations or can't-miss business opportunities coming out of it. The value of raw data is overrated.
评论 #779920 未加载