Summary: On August 30, 2020 10:04 GMT, CenturyLink identified an issue
to be affecting users across multiple markets. The IP Network Operations
Center (NOC) was engaged, and initial research identified that an
offending flowspec announcement prevented Border Gateway Protocol (BGP)
from establishing across multiple elements throughout the CenturyLink
Network. The IP NOC deployed a global configuration change to block the
offending flowspec announcement, which allowed BGP to begin to correctly
establish. As the change propagated through the network, the IP NOC
observed all associated service affecting alarms clearing and services
returning to a stable state.<p>Source <a href="https://puck.nether.net/pipermail/outages/2020-August/013229.html" rel="nofollow">https://puck.nether.net/pipermail/outages/2020-August/013229...</a>
Massive reconvergence event in their network, causing edge router bgp sessions to bounce (due to cpu). Right now all their big peers are shutting down sessions with them to give level3s network the ability to reconverge. Prefixes announced to 3356 are frozen on their route reflectors and not getting withdrawn.<p>Edit: if you are a Level3 customer shut your sessions down to them.
CenturyLink/Level3 on Twitter:
"We are able to confirm that all services impacted by today’s IP outage have been restored. We understand how important these services are to our customers, and we sincerely apologize for the impact this outage caused."<p><a href="https://twitter.com/CenturyLink/status/1300089110858797063" rel="nofollow">https://twitter.com/CenturyLink/status/1300089110858797063</a>
India just lost to Russia in the final of the firstever online chess olympiad, probably due to connection issues of two of its players. I wonder if it's related to this incident and if the organizers are aware.
Edit: the organizers are aware, and Russia and India have now been declared joint winner.
I was doing development work which uses a server I've got hosted on digital ocean. I started getting intermittent responses which I thought weird as I hadn't changed anything on the server. I spent a good ten minutes trying to debug the issue before searching for something on duckduckgo, which also didn't respond. Cloudfare shouldn't be involved at all with my little site, so I don't think it's limited to just them.
M5 Hosting here, where this site is hosted. We just shut down 2 sessions with Level3/CenturyLink because the sessions were flapping and we were not getting complete full route table from either session. There are definitely other issues going on on the Internet right now.
Analysis of what we saw at Cloudflare, how our systems automatically mitigated the worst of the impact to our customers, and some speculation on what may have gone wrong: <a href="https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/" rel="nofollow">https://blog.cloudflare.com/analysis-of-todays-centurylink-l...</a>
I had this earlier! A bunch of sites were down for me, I couldn't even connect to this site.<p>The problem is I don't know where to find what was going on (tried looking up live DDOS-tracking websites, "is it down or is it just me" websites, etc. I couldn't find a single place talking about this.<p>Is there a source where you can get instant information on Level3 / global DNS / major outages?
Has anyone any good resources for learning more about the "internet-level" infrastructure affected today and how global networks are connected?
Odd, I'm trying to reach a host in Germany (AS34432) from Sweden but get rerouted Stockholm-Hamburg-Amsterdam-London-Paris-London-Atlanta-São Paulo after which the packets disappear down a black hole. All routing problems occur within Cogentco.<p><pre><code> 3 sth-cr2.link.netatonce.net (85.195.62.158)
4 te0-2-1-8.rcr51.b038034-0.sto03.atlas.cogentco.com
5 be3530.ccr21.sto03.atlas.cogentco.com (130.117.2.93)
6 be2282.ccr42.ham01.atlas.cogentco.com (154.54.72.105)
7 be2815.ccr41.ams03.atlas.cogentco.com (154.54.38.205)
8 be12194.ccr41.lon13.atlas.cogentco.com (154.54.56.93)
9 be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)
10 be2315.ccr31.bio02.atlas.cogentco.com (154.54.61.113)
11 be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)
12 be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)
13 be2027.ccr22.mia03.atlas.cogentco.com (154.54.86.206)
14 be2025.ccr22.mia03.atlas.cogentco.com (154.54.47.230)
15 * level3.mia03.atlas.cogentco.com (154.54.10.58)
16 * * *
17 * * *</code></pre>
I'm having issues reaching IP addresses unrelated to Cloudflare. Based on some traceroutes, it seems AS174 (Cogent) and AS3356 (Level 3) are experiencing major outages.
This explains a lot. Initially thought my mobile phone Internet connectivity was flakey because I couldn't access HN here in Australia, whilst it's fine over wi-fi (wired Internet).
Misread the headline as "Level 3 Global Outrage" and thought "someone had defined outrage levels?" and "it doesn't matter, he'll just attribute it to the Deep State".<p>In some ways I'm a little bit disappointed it's only a glitch in the internet.
Had to laugh: "I'm seeing complaints from all over the planet on Twitter"<p>The one site I can't see is Twitter. (Not a heart-wrenching loss, mind you...)
No peering problems from my network with Level3 in London Telehouse West, maybe a minute or so of increased latency at 10:09 GMT<p>Routing to a level3 ISP I have an office in in the states peers with London15.Level3.net<p>No problem to my Cogent ISP in the states, although we don't peer directly with Cogent, that bounces via Telia<p>Going east from London, a 10 second outage at 12:28:42 GMT on a route that runs from me, level3, tata in India, but no rerouting.
So, that's why HN is unreachable from Belgium at the moment (right when I was trying to figure a dns cache problem in Firefox,of course).<p>An ssh tunnel through OVH/gravelines is working so far. edit: Proximus. edit2: also, Orange Mobile
This had me really confused until I saw it was a global outage. I have been getting delayed iOS push notifications (from prowl) now for the last few hours, from a device I was fairly sure I had disconnected 3 hours ago (a pump)<p>Got questioning if I really disconnected it before I left.<p>I'm wondering if we're at the point where internet outages should have some kind of (emergency) notification/sms sent to _everyone_.
NANOG are talking about a CenturyLink outage and BGP flapping (AS 3356) as of 03:00 US/Pacific, AS209 possibly also affected.<p>AS3356 is Level 3, AS209 is CenturyLink.<p><a href="https://mailman.nanog.org/pipermail/nanog/2020-August/209359.html" rel="nofollow">https://mailman.nanog.org/pipermail/nanog/2020-August/209359...</a>
> <i>"Root Cause: An offending flowspec announcement prevented BGP from establishing correctly, impacting client services."</i><p>--<p>That doesn't really explain the "stuck" routes in their RRs... maybe it'll make sense once we've gotten some more details...
Everything to Oracle Cloud's Ashburn US-East location is down.<p>Their console isn't responding at all and all my servers are unreachable. Their status console reports all normal though.
Seems like "the internet" works again here in Norway. I've been limited to local sites all day.<p>Hacker news has been off for several hours for me.<p>Whatever it was it must have been nasty.
There is a major internet outage going on. I am using Scaleway they are also affected. According to Twitter, Vodafone, CityLink and many more are also affected.
A service I run on Digital Ocean was affected by this early this morning. Looks like it was mitigated by DO - so I'm very grateful for that. Although, the service I run is time sensitive so failures like this are pretty unfortunate for me. Where would I get started with building in redundancy against these sort of outages?
Fastly is also seeing problems. [0]<p>However, they report that they've identified the issue and are fixing it.<p>[0]: <a href="https://status.fastly.com/" rel="nofollow">https://status.fastly.com/</a>
Internet infrastructure is broken.<p>Why do a few companies control the backbone of the internet? Shouldn’t there be a fallback or disaster recovery plan if one or more of these companies become unavailable?
Even <a href="https://downdetector.com/" rel="nofollow">https://downdetector.com/</a> has problems loading for me. Middle Europe
*internetweathermap is down
Chess.com was down due to the outage and some of the Indian players got disconnected and lost on time, so FIDE declared India-Russia joint winner of the Online Chess Olympiad 2020.
Shameless plug:<p>I spent too much time losing precious time when github/npm/cloudflare are going down, until I figure out it was them.<p>So currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.<p>[1] <a href="https://monitory.io" rel="nofollow">https://monitory.io</a>
Cloudflare status page:
Update - Major transit providers are taking action to work around the network that is experiencing issues and affecting global traffic.<p>We are applying corrective action in our data centers as the situation changes in order to improve reachability
Aug 30, 14:26 UTC<p><a href="https://www.cloudflarestatus.com" rel="nofollow">https://www.cloudflarestatus.com</a>
I just experienced HN down for several minutes before it loaded and I saw this story at the top.<p>I'm doing something with the HN API as I type this, so for a moment I was trying to decide if I'd been IP blocked, even though the API is hosted by Firebase.<p>I haven't noticed any obvious issues elsewhere yet.<p>(Just got a delay while trying to submit this comment.)
Could this be a Russia move vis a vis today's expected Belarus protests?<p>(I hope this doesn't mean a violent crackdown is imminent)<p>Oy <a href="https://mobile.twitter.com/HannaLiubakova/status/1300064535697555456" rel="nofollow">https://mobile.twitter.com/HannaLiubakova/status/13000645356...</a>
Can anyone help me understand why I can't access HN from my iPhone, but I can from my computer? both are on the same network. I'm getting "Safari cannot open the page because the server cannot be found", and many apps won't work at all either.
It wasn't a total outage for the site I was trying to reach. It took about 20 minutes to make an order, but after multiple retries (errors were reported as a 522 with the problem being somewhere between Manchester, UK and the host), it did go through.
I have two pipes from two different (consumer ISPs) at home. One can reach HN, the other can't.<p>Incidentally, uBlock Origin seems to be completely broken. It doesn't have any local blacklists to work when their ?servers? are unavailable?
From the other (Cloudflare) thread (post: <a href="https://news.ycombinator.com/item?id=24322603" rel="nofollow">https://news.ycombinator.com/item?id=24322603</a>), the outages list (<a href="https://puck.nether.net/mailman/listinfo/outages" rel="nofollow">https://puck.nether.net/mailman/listinfo/outages</a>).<p><a href="https://puck.nether.net/pipermail/outages/2020-August/thread.html" rel="nofollow">https://puck.nether.net/pipermail/outages/2020-August/thread...</a><p>Not a network engineer, but based on the comments there it looks like it's a BGP blackhole incident.<p>Edit: removed details about the similarity to a 1997 incident based in input from commenters.
This knocked out the Starbucks app and some of their systems this morning. A bunch of people in line couldn't log in and they were saying parts of their whole internal system were down, too.
I'm confused about why Cloudflare had problems but other CDN providers/sites with private CDNs like Google did not. Is there something different about how Cloudflare operates?
I experienced this issue while reading docs at "Read the Docs" (and ironically had connection issues while trying to read this very exact page right here, too.)
I was doing a big release over the evening. I was working fine up until about 6 hours ago, when I signed off. Our network monitors show an outage started about half an hour later (at about 4:05am CST). Service restored a few minutes ago, at about 9:44am CST. I don't know if our problem is the same as this problem, but we are on CenturyLink.
also related <a href="https://www.cloudflarestatus.com/incidents/hptvkprkvp23" rel="nofollow">https://www.cloudflarestatus.com/incidents/hptvkprkvp23</a>
How the <i>xxxx</i> did it take CenturyLink/Level3 like 3-4 hours to fix this problem?<p>Again (<a href="https://news.ycombinator.com/item?id=24322988" rel="nofollow">https://news.ycombinator.com/item?id=24322988</a>) not a network engineer, but it seemed like their routers actively stopped other networks from working around the problem since L3 would still keep pushing other networks' old routes, even after those networks tried to stop that.<p>Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.
Based on what I've seen: They essentially "shut down the Internet" for probably a quarter of the global population for about 3-4 hours.<p>That response time is atrocious. It wasn't that they needed to fix broken hardware, rather they needed to stop running hardware from actively sabotaging the global routing via the inherently insecure BGP protocol. That took 3-4 hours to happen.<p>As an example: Being in Sweden with an ISP that uses Telia Carrier for connectivity things started working around the time of <a href="https://twitter.com/TeliaCarrier/status/1300074378378518528" rel="nofollow">https://twitter.com/TeliaCarrier/status/1300074378378518528</a>