all hail their paper Focused Crawling[1][2] on their hardware infrastrucutre! what a massive system ;). who else was actively talking about their infrastrucutre at the time (1999)? very few.<p>> The world-wide web, having over 350 million pages, continues to grow rapidly at a million pages per day. About 600 GB of text changes every month. Such growth and flux poses basic limits of scale for today's generic crawlers and search engines. At the time of writing, Alta Vista's crawler, called the Scooter, runs on a 1.5 GB memory, 30 GB RAID disk, 533MHz AlphaServer 4100-5/300 with 1 GB/s I/O bandwidth. Scooter connects to the indexing engine Vista, which is a 2 GB memory, 180 GB RAID disk, 2 533MHz AlphaServer 4100-5/300. (The query engine is even more impressive, but is not relevant to our discussion.) Other giant web crawlers use similar fire-power, although in somewhat different forms, e.g., Inktomi uses a cluster of hundreds of Sun Sparc workstations5 with 75 GB of RAM and over 1 TB of spinning disk, and it crawls over 10 million pages a day.<p>[1] <a href="http://www.public.asu.edu/~hdavulcu/Focused%20Crawling.htm" rel="nofollow">http://www.public.asu.edu/~hdavulcu/Focused%20Crawling.htm</a><p>[2] <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.1111&rank=4&q=focused%20crawling&osm=&ossid=" rel="nofollow">https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43....</a>
And who here remembers astalavista.box.sk?<p><a href="https://en.m.wikipedia.org/wiki/Astalavista.box.sk" rel="nofollow">https://en.m.wikipedia.org/wiki/Astalavista.box.sk</a>