Recently there was a lot of talk about diminishing Google search results, missing forum links, etc.
That got me thinking. At what point in time or Volume is the internet too big for Google to index it?
The problem isn’t that the internet is too big. It’s that Google and the internet grew apart from each other.<p>Some old sites never upgraded to https or other technical demands Google made of them. Google chose to stop indexing these sites to force them to change their behavior.<p>Most new content is trapped in walled gardens of some format. The one I see all the time is Discord, but the communities you care about are probably talking in a non-indexable group chat rather than an indexable Internet forum like they might have 20 years ago.
I don't think google's recent troubles are a result of index size.<p>Pretty sure G could throw more money and resources at it if they thought that would make a dent.<p>It feels more like a lot of real content is collateral damage to the SEO vs google wars. e.g. A blogger was complaining the other day that someone had set up an automation to automatically scrape their content the second it gets published, run it through translate twice and publish the resulting semi-gibberish.<p>Those sort of shenanigans are I suspect quite hard to deal with even if you're google
Google already doesn't index the entire Internet. The internet even having a size becomes more questionable the more you think about it.<p>Let's say we set up a wildcard domain *.example.com all pointing to a server set up so that<p><pre><code> 0.example.com/ has a link to 0.example.com/0 and 1.example.com/
0.example.com/0 has a link to 0.example.com/1 and 0.example.com/0/0
0.example.com/0/0 has a link to 0.example.com/0/1 and 0.example.com/0/0/0
1.example.com/ has a link to 1.example.com/1 and 2.example.com/
1.example.com/0 has a link to 1.example.com/1 and 1.example.com/0/0
</code></pre>
and so forth.<p>This way even a raspberry pi is able to trivially host an infinite number of infinite websites.
I thought google was deliberately "forgetting" older material, to make room for new. The "long tail" turned out to be hogwash, and advertising corrupted everything.
Google for too small for the internet around the time if their first penguin + panda updates..
They killed off a lot of good sites in the war against SEO.<p>They have also reduced the image search, torrents, and many other things have been removed via the censoring if the index and YouTube..<p>Pushing corporate type sites up and other things into the nether , Google has become the new yellow pages along with being an arbiter or higher ranked health info and links to old answers for programmers.<p>Basic things like recipes are so bad even non Oliver jokes about reading a dozen paragraphs before finding a recipe via Google.<p>So many things fun / entertainment/ sexy and more have no room for the high brow expectations of the big G.<p>Hence TikTok being more popular than Google now.<p>Some of the things they have removed have come from govs and industries with a lot of sway.. but much of what they downrank post panda penguin is a vieled attempt at being more politically correct and less blue collar.<p>Imo.<p>So indeed there is now room for other 'searxh/find' portals for things Google does t want to showcase on their front pages..<p>But for at least some while into the future they will likely be the best yellow pages since customers do most of that work for them.
Infrastructure wise scaling will continue to be possible with CS innovations. Algo wise, I am not sure if we can handle all the additional adversarial content and noise. A lot of index pruning happens just to reduce adversarial content and noise. However, ultimately it all comes down to cost in long run. Cost of crawling and serving extra Y% needs to be equal or lower than the potential drop in revenue in long run. At current stage, it is likely that vast majority of crawlable internet is not actually in index. By some measure, just 50B pages were sufficient to keep most users fairly happy. Going to 150B pages has marginal gain that small players cannot afford. The reachable size of internet is well over 1T pages.
Google can cope with a Zettabye Era (<a href="https://en.m.wikipedia.org/wiki/Zettabyte_Era" rel="nofollow">https://en.m.wikipedia.org/wiki/Zettabyte_Era</a>) it’s separating wheat from chaff which is the hard problem. Also most data is largely being siloed behind walled gardens and can’t be indexed.