You have to consider that all that data isn't just a list of web pages. That includes information provided by the web page and the contents of the page itself (as well as associated metadata).<p>1) Create a map of the web (what links where and how), enabling an enhanced "browsing" experience (no more perusing a site for links to other interesting places)<p>2) The contents of those pages contain a large volume of technical documentation as well as a large number of opinions of the technologies that rose from the documentation. With both, and a large enough history of the creation of pages, the release of the documentation and the opinions presented, a model can be built to predict the success rate of any particular technology.<p>3) Given sufficient time, there is a large enough set of text that can be rendered to audio through speech synthesis (higher quality the better) in order to train a speech recognition system to a previously unseen level of accuracy.<p>4) Given a proper algorithm, sites with security vulnerabilities can be discovered just from what is crawlable and most likely should not be accessed. From that list, and given the total number of unique websites contained in the database, you can calculate the ratio of harmful to potentially non-harmful websites and provide a risk threshold of any given link on any given page.<p>5) Provide a search engine that uses regular expressions on different sets of the data (metadata, tags, text, text in a specific tag, etc.) to present a search engine with a previously unseen level of accuracy (accuracy of the results is dependent on the individual doing the searching, not the systems ability to guess that by std::list, I didn't mean "Sexually Transmitted Diseases STD List" (seriously, <a href="http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=std::list" rel="nofollow">http://www.google.com/search?sourceid=chrome&ie=UTF-8...</a> has that as the second result).<p>Edit: I failed at formatting