10% of the top million sites are dead

375 点作者 Soupy将近 3 年前

33 条评论

gojomo将近 3 年前

Many issues with this analysis, some others have already mentioned, including:• The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP• In many cases any responding web server will be on the `www.` subdomain, rather than the domain that was listed/probed – & not everyone sets up `www.` to respond/redirect. (Author misinterprets appearances of `www.domain` and `domain` in his source list as errant duplicates, when in fact that may be an indicator that those `www.domain` entries also have significant `subdomain.www.domain` extensions – depending on what Majestic means by 'subnets'.)• Many sites may block `curl` requests because they only want attended human browser traffic, and such blocking (while usually accompanied with some error response) can be a more aggressive drop-connection.• `curl` given a naked hostname likely attempts a plain HTTP connection, and given that even browsers now auto-prefix `https:` for a naked hostname, some active sites likely have nothing listening on plain-HTTP port anymore.• Author's burst of activity could've triggered other rate-limits/failures - either at shared hosts/inbound proxies servicing many of the target domains, or at local ISP egresses or DNS services. He'd need to drill-down into individual failures to get a beter idea to what extent this might be happening.If you want to probe if domains are still active:• confirm they're still registered via a `whois`-like lookup• examine their DNS records for evidence of current services• ping them, or any DNS-evident subdomains• if there are any MX records, check if the related SMTP server will confirm any likely email addresses (like postmaster@) as deliverable. (But: don't send an actual email message.)• (more at risk of being perceived as aggressive) scan any extant domains (from DNS) for open ports running any popular (not just HTTP) servicesIf you want to probe if web sites are still active, start with an actual list of web site URLs that were known to have been active at some point.

评论 #32114902 未加载

评论 #32113424 未加载

smugma将近 3 年前

I downloaded the file and looked at the second 000 in his file, which refers to wixsite.com.It appears that wixsite.com isn't valid but www.wixsite.com is, and redirects to wix.com.It's misleading to say that the sites are dead. As noted elsewhere, his source data is crap (other sites I checked such as wixstatic.com don't appear to be valid) but his methodology is bad, or at least his describing the sites as dead is misleading.

评论 #32111678 未加载

评论 #32111680 未加载

评论 #32112703 未加载

评论 #32112828 未加载

bioemerl将近 3 年前

I'm honestly amazed that out of the top million sites, which probably includes a ton of tiny tiny sites that are idle or abandoned, only ten percent are offline.

评论 #32111944 未加载

评论 #32111129 未加载

评论 #32111558 未加载

评论 #32111869 未加载

tete将近 3 年前

The biggest problem I find is that it seems to be pretty "outdated" to keep redirects in place, if you move stuff. So many links to news websites, etc. will cause a redirect to either / or a 404 (which is a very odd thing to redirect to in my opinion).If you are unlucky an article you wanted to find also completely disappeared. This is scary, because it's basically history disappearing.I also wonder what will happen to text on websites that are some ajax and javascript breaks because a third party goes down. While the internet archive seems to be building tools for people to use to mitigate this I found that they barely worked on websites that do something like this.Another worry is the ever-increasing size of these scripts making archiving more expensive.

评论 #32111554 未加载

评论 #32115210 未加载

gravitate将近 3 年前

> Domain normalization is a bitchI’m a no-www advocate. All my sites can be accessed from the Apex domain. But some people for whatever reason like to prepend www to my domains, so I wrote a rule in Apache’s .HTACCESS to rewrite the www to the Apex.Here’s a tutorial for doing that: <a href="https://techstream.org/Web-Development/HTACCESS/WWW-to-Non-WWW-and-Non-WWW-to-WWW-redirect-with-HTACCESS" rel="nofollow">https://techstream.org/Web-Development/HTACCESS/WWW-to-Non-W...</a>

评论 #32112457 未加载

评论 #32112034 未加载

评论 #32114924 未加载

baby将近 3 年前

Free.fr, one of the biggest ISP in France a while back, and perhaps still today, still runs all the old-school websites it was hosting for people (for free) today. It's quite insane, but a lot of the French web 1.0 is still alive today thanks to them. Truly an ISP ran by passionate technical people.

评论 #32116253 未加载

altdataseller将近 3 年前

All these top million lists are very good at telling you the top most 10K-50K sites on the web. After that, you're going into 'crapshoot' land, where the 500,000th most popular site is very likely to be a site that got some traffic a long time ago, but now isn't even up.So I would take this data with a grain of salt. You're better off just analyzing the top 100K sites on these lists.

评论 #32111905 未加载

评论 #32115261 未加载

the_biot将近 3 年前

By what possible criteria are these the "top" million sites, if 10% are dead? I'd start with questioning that data.

评论 #32111493 未加载

评论 #32111339 未加载

MonkeyMalarky将近 3 年前

Last time I tried to crawl that many domains, I ran into problems with my ISP's DNS server. I ended up using a pool of public DNS servers to spread out all the requests. I'm surprised that wasn't an issue for the author?

评论 #32111069 未加载

ocdtrekkie将近 3 年前

I've been working on trying to migrate sites I ran in 2008 or so into my new preferred hosting strategy lately: I know zero people look at them, since many were functionally broken at present, but I don't like the idea of actually removing them from the web. So I'm patching them up, migrating them to a more maintainable setting, and keeping them going. Maybe someday some historian will get something out of it.

macintux将近 3 年前

Title is misleading: that’s the outcome, but the bulk of the story is the data processing to reach that conclusion.

评论 #32111435 未加载

phkahler将近 3 年前

Read that again folks:"a very reasonable but basic check would be to check each domain and verify that it was online and responsive to http requests. With only a million domains, this could be run from my own computer relatively simply and it would give us a very quick temperature check on whether the list truly was representative of the “top sites on the internet”. "This took him 50 minutes to run. Think about that when you want to host something smaller than a large commercial site. We live in the future now, where bandwidth is relatively high and computers are fast. Point being that you don't need to rent or provision "big infrastructure" unless you're actually quite big.

评论 #32111201 未加载

评论 #32111174 未加载

评论 #32111414 未加载

评论 #32111200 未加载

mouzogu将近 3 年前

whenever i go through my bookmarks, i tend to find maybe 5-10% are now 404.this is why i like the archive.ph project so much and using it more as a kind of bookmarking service.

评论 #32111156 未加载

评论 #32112168 未加载

yajjackson将近 3 年前

Tangential, but I love the format for your site. Any plans to do a "How I built this blog" post?

评论 #32112035 未加载

terrycody将近 3 年前

Nice work.Just one thing, analyze sites by total referring domains is not accurate as your result showed. A backlink can be easily faked and you can literally spam 1 million links within 1 day for any domain. Thus, this data source is not much useful.For a more accurate result, try to use Ahrefs top 1 million domains, ranked by their traffics. Ahrefs rank sites by their ranking keywords, thus infer the traffic numbers, meaning, these websites are live, and ranking with some keywords.You will see the result is much more accurate then, maybe not even a single website will be offline, because they are earning good cash.

allknowingfrog将近 3 年前

I don't have any particular opinions on the author's conclusions, but I learned a thing or two about the power of terminal commands by reading through the article. I had no idea that xargs had a parallel mode.

评论 #32111821 未加载

flas9sd将近 3 年前

having the luxury of scrutinizing the method and retesting: to "normalize" domains and skip the www skewed results - not all websites do their redirects across apex to www (and schemas). Some servers weren't answering the request with the default curl accept header / and needed encouragement.I retested the 000 class of .de ccTLD (1227) and found more than a third (473) of them answering when prefixed with www. Lots of german universities were false negatives - if this is representative I cannot tell, just a hint to retest.

banana_giraffe将近 3 年前

The takeaway from this is slightly off. There aren't 107776 sites that are dead, there are 107776 sites that don't run a HTTP server, or are otherwise dead.If you try to connect via HTTP or HTTPS, then a quick run yields 91106 sites that are dead, or 9.11%(And I ran this test on an AWS EC2 node with a fairly aggressive timeout. No doubt some % of sites play dead to AWS, or didn't respond fast enough for me)

zX41ZdbW将近 3 年前

This looks surprisingly similar to the unfinished research that I did: <a href="https://github.com/ClickHouse/ClickHouse/issues/18842" rel="nofollow">https://github.com/ClickHouse/ClickHouse/issues/18842</a>

kozziollek将近 3 年前

Most of cities in Poland have their own $city.pl domain and allow websites to buy $website.$city.pl. That might not be well known. And cities have theri websites, so I guess it's OK.But info.pl and biz.pl? Did nobody hear about country variants of gTLDs?!

评论 #32113004 未加载

ghostly_s将近 3 年前

Wow, I would not have suspected `tee` is able to handle multiple processes writing to the same file. Doesn't seem to be mentioned on the man-page, either.

评论 #32114715 未加载

pahool将近 3 年前

zombo.com still kicking!

评论 #32112188 未加载

nr2x将近 3 年前

Majestic is a shit list. Mystery solved.

indigodaddy将近 3 年前

Are there more cycles/cpu/work involved to `cat verylargefile | awk` vs `awk verylargefile` ?

gumby将近 3 年前

His 'www' logic is flawed: <a href="https://www.example.com" rel="nofollow">https://www.example.com</a> and <a href="https://example.com" rel="nofollow">https://example.com</a> need not return the same results, but his checking code sends the output straight to /dev/null so he has no way of knowing.

评论 #32112377 未加载

winddude将近 3 年前

No they're not.

noiv将近 3 年前

How does a dead site make it into the top million?

评论 #32115137 未加载

kderbyma将近 3 年前

wouldn't this imply that either the ranking system is broken.....or there are less than 1 million active sites.....

softwaredoug将近 3 年前

My current beliefs about how people use and trust information on the Web.First, trust is _everything_ on the Web, it is the thing people first think of when arriving on some information. But how people evaluate trust has changed dramatically over the last 10 years.- Trust now comes almost exclusively from social proof. Searching reddit, youtube, etc and other extremely _moderated_ sources of information, where the most work is done to ensure content comes from actual human beings. How many of us now google `<topic> reddit` instead of just `<topic>`?- Of course a lot of this trust is misplaced. There's a very thin line between influencers and cult leaders / snake oil salesmen. Our last President used this hack really effectively.- Few trust Google's definition of trust anymore -- essentially page rank. This made more sense when the Web essentially was social, where inbound links were very organic. Now with the trust in general Web sites evaporated, the main 'inbound links' anyone cares about come from individuals or community they trust or identify with. They don't trust Googles algorithm (its too opaque, and too easily gamed).This of course means the fracturing of truth away from elites. Sometimes this could be a good thing, but in many cases cough Covid cough it might be pretty disastrous for misinformation

评论 #32111930 未加载

评论 #32112137 未加载

zzzeek将近 3 年前

irony that the site is not responding?

zinekeller将近 3 年前

TLDR: Campbell's methodology is flawed, does not consider edge cases (one of which (equating apex-only and www-prefixed domains) I consider reckless), and didn't understand how Majestic collects and processes its data.Longer version: This isn't comprehensive, but I think of two main reasons why:- The Majestic Million lists only the registrable part (with some exceptions), and this sometimes lead to central CDNs being listed. For example, the Majestic Million lists wixsite.com (for those who are unaware is a CDN domain used by Wix.com with separate subdomains), but if you visit wixsite.com you wouldn't get anything. Same with Azure, subdomains of azureedge.net and azurewebsites.net do exist (for example <a href="https://peering.azurewebsites.net/" rel="nofollow">https://peering.azurewebsites.net/</a>) but azureedge.net and azurewebsites.net themselves don't exist. Without similar filtering, using the Cisco list (<a href="https://s3-us-west-1.amazonaws.com/umbrella-static/index.html" rel="nofollow">https://s3-us-west-1.amazonaws.com/umbrella-static/index.htm...</a>) would quickly lead you to this precise problem (mainly because the number one is "com", but phew at least <a href="http://ai./" rel="nofollow">http://ai./</a> does exist!)- Also, shame on the author considering www-prefixed and apex-only as one and the same. For some websites, it isn't. Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on <a href="http://jma.go.jp/" rel="nofollow">http://jma.go.jp/</a> but is fine on <a href="https://www.jma.go.jp/" rel="nofollow">https://www.jma.go.jp/</a>. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will. And for ncbi.nlm.nih.gov (National Center for Biotechnology Information) ? I can't blame Majestic: <a href="https://www.ncbi.nlm.nih.gov/" rel="nofollow">https://www.ncbi.nlm.nih.gov/</a> and <a href="https://ncbi.nlm.nih.gov/" rel="nofollow">https://ncbi.nlm.nih.gov/</a> don't redirect to a canonical domain, and unless you've compared the HTTP pages there's no way you would know that they are the same website!Edit: I've downloaded out the CSV to check my claims, and it shows:<pre><code> wixsite.com 0 beian.gov.cn 0 </code></pre> Please, for the love of sanity, consider what the Majestic Million (and similar lists) criterion on inclusion. I can't believe it to say, but can we crowd-source "Falsehoods programmers believe about domains"?Also addendum to crawling but I consider "probably forgivable":- Some websites are only available in certain countries (internal Russian websites don't respond at all outside Russia for example). This can skew the numbers a little bit.

评论 #32111466 未加载

评论 #32111562 未加载

评论 #32111377 未加载

spaceman_2020将近 3 年前

Not surprising. We're far away from the glory days of the vibrant, chaotic web.In countries like India that onboarded most users through smartphones instead of computers, websites are not even necessary. There's a huge dearth of local-focused web content as well since there just isn't enough demand.

评论 #32112562 未加载

superb-owl将近 3 年前

One of the few things I like about blockchain is the promise of a less ephemeral web.

评论 #32111124 未加载

评论 #32112071 未加载

评论 #32111349 未加载