TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

At what point is the internet too big for Google to index it?

11 点作者 blablablub将近 3 年前
Recently there was a lot of talk about diminishing Google search results, missing forum links, etc. That got me thinking. At what point in time or Volume is the internet too big for Google to index it?

12 条评论

wilde将近 3 年前
The problem isn’t that the internet is too big. It’s that Google and the internet grew apart from each other.<p>Some old sites never upgraded to https or other technical demands Google made of them. Google chose to stop indexing these sites to force them to change their behavior.<p>Most new content is trapped in walled gardens of some format. The one I see all the time is Discord, but the communities you care about are probably talking in a non-indexable group chat rather than an indexable Internet forum like they might have 20 years ago.
Havoc将近 3 年前
I don&#x27;t think google&#x27;s recent troubles are a result of index size.<p>Pretty sure G could throw more money and resources at it if they thought that would make a dent.<p>It feels more like a lot of real content is collateral damage to the SEO vs google wars. e.g. A blogger was complaining the other day that someone had set up an automation to automatically scrape their content the second it gets published, run it through translate twice and publish the resulting semi-gibberish.<p>Those sort of shenanigans are I suspect quite hard to deal with even if you&#x27;re google
评论 #31963055 未加载
marginalia_nu将近 3 年前
Google already doesn&#x27;t index the entire Internet. The internet even having a size becomes more questionable the more you think about it.<p>Let&#x27;s say we set up a wildcard domain *.example.com all pointing to a server set up so that<p><pre><code> 0.example.com&#x2F; has a link to 0.example.com&#x2F;0 and 1.example.com&#x2F; 0.example.com&#x2F;0 has a link to 0.example.com&#x2F;1 and 0.example.com&#x2F;0&#x2F;0 0.example.com&#x2F;0&#x2F;0 has a link to 0.example.com&#x2F;0&#x2F;1 and 0.example.com&#x2F;0&#x2F;0&#x2F;0 1.example.com&#x2F; has a link to 1.example.com&#x2F;1 and 2.example.com&#x2F; 1.example.com&#x2F;0 has a link to 1.example.com&#x2F;1 and 1.example.com&#x2F;0&#x2F;0 </code></pre> and so forth.<p>This way even a raspberry pi is able to trivially host an infinite number of infinite websites.
评论 #31963429 未加载
bediger4000将近 3 年前
I thought google was deliberately &quot;forgetting&quot; older material, to make room for new. The &quot;long tail&quot; turned out to be hogwash, and advertising corrupted everything.
评论 #31970247 未加载
throwaway532532将近 3 年前
Google for too small for the internet around the time if their first penguin + panda updates.. They killed off a lot of good sites in the war against SEO.<p>They have also reduced the image search, torrents, and many other things have been removed via the censoring if the index and YouTube..<p>Pushing corporate type sites up and other things into the nether , Google has become the new yellow pages along with being an arbiter or higher ranked health info and links to old answers for programmers.<p>Basic things like recipes are so bad even non Oliver jokes about reading a dozen paragraphs before finding a recipe via Google.<p>So many things fun &#x2F; entertainment&#x2F; sexy and more have no room for the high brow expectations of the big G.<p>Hence TikTok being more popular than Google now.<p>Some of the things they have removed have come from govs and industries with a lot of sway.. but much of what they downrank post panda penguin is a vieled attempt at being more politically correct and less blue collar.<p>Imo.<p>So indeed there is now room for other &#x27;searxh&#x2F;find&#x27; portals for things Google does t want to showcase on their front pages..<p>But for at least some while into the future they will likely be the best yellow pages since customers do most of that work for them.
jstx1将近 3 年前
I really don&#x27;t think the the reduced quality in results is because the Internet is too big all of a sudden.
sytelus将近 3 年前
Infrastructure wise scaling will continue to be possible with CS innovations. Algo wise, I am not sure if we can handle all the additional adversarial content and noise. A lot of index pruning happens just to reduce adversarial content and noise. However, ultimately it all comes down to cost in long run. Cost of crawling and serving extra Y% needs to be equal or lower than the potential drop in revenue in long run. At current stage, it is likely that vast majority of crawlable internet is not actually in index. By some measure, just 50B pages were sufficient to keep most users fairly happy. Going to 150B pages has marginal gain that small players cannot afford. The reachable size of internet is well over 1T pages.
sacrosanct将近 3 年前
Google can cope with a Zettabye Era (<a href="https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;Zettabyte_Era" rel="nofollow">https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;Zettabyte_Era</a>) it’s separating wheat from chaff which is the hard problem. Also most data is largely being siloed behind walled gardens and can’t be indexed.
cpach将近 3 年前
I believe we have already reached that point.
dekhn将近 3 年前
Google blackholes many sites and they don&#x27;t get indexed.
dontbenebby将近 3 年前
Too walled is the issue, not too big. Much is behind things like Facebook etc
betaby将近 3 年前
It feels that Internet is trivially small, at lest text based part.