It's a bad outage, it seems. 16+ hours now, and counting, and they don't seem to have a root cause or a hot standby.<p>We just cut over all of our stuff (Ambassador API Gateway) to Docker Hub. Lots of the Kubernetes ecosystem is on Quay. I wonder how this affecting others. Our users are definitely affected, as well as our development team.
This literally cost me sleep last night, paging me for new Kubernetes nodes that failed to transition to 'Ready' because they were unable to pull Calico images during bootstrapping. After some duct-taping to get those initial nodes up and running, we just moved the Quay-hosted images to a GCR repository and moved on with life.<p>But that doesn't diminish the fact that this outage is a complete disaster for Quay.
For those on k8s: quick reminder, unless you're in development mode your image pull policy should be IfNotPresent so you have at least some measure of caching to protect you from further degradation of service.<p>Beyond this, I'm updating things to use GCR so that if this bleeds into tomorrow my team's development timeline isn't impacted any further.
Yup, we also had alerts for our k8s nodes not getting ready (calico images). We moved to Docker Hub, but experienced timeouts there a few times. I guess they got a sudden spike in traffic.<p>This will be bad for Red Hat's reputation.