TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The woes of building an index of the web

69 pointsby jennitaover 9 years ago

3 comments

ChuckMcMover 9 years ago
This is a really great description of why building crawlers (and indexes) is a really hard problem. Basically 90% of the &quot;web&quot; is now crap, and by crap I mean stuff you would never ever want to visit as a real human being. Our crawler once found an entire set of subdomains with nothing but markkov chain generated &quot;forum&quot; pages, and of course SEO links for page rank love (note to SEO types, this hasn&#x27;t fooled Google for at least 6 years).<p>The explosion of cheap CPU and storage means that single server with a few terabytes of disk can serve up a billion or more spam pages. And seemingly everyone who gets into the game starts with &quot;I know, we&#x27;ll create a lot of web sites that link to this thing I&#x27;m trying to get to rank in Google results ...&quot; worse, when it doesn&#x27;t work they don&#x27;t bother taking that crap down, they just link to it from more and more other sites in an attempt to get its host authority better. That doesn&#x27;t work either (for getting page rank)<p>But what it means is that 99.9% of all new web pages created on a given day, are created by robots or algorithms or other agencies without any motive to provide value, merely to provide &quot;inventory&quot; for advertisements. You are lucky if you can pull a billion &quot;real&quot; web pages out of a crawl frontier of 100 billion URIs.
评论 #10568889 未加载
greglindahlover 9 years ago
Note that they&#x27;re building a graph of the web for SEO purposes, not a search engine index.
评论 #10571612 未加载
sqldbaover 9 years ago
I read the whole thing and did a search and still don&#x27;t know what the index is, what it&#x27;s for, or how to use it.
评论 #10569116 未加载