Kubernetes Failure Stories

513 pointsby hjacobsover 6 years ago

15 comments

m0zgover 6 years ago

It's not for everyone and it has significant maintenance overhead if you want to keep it up to date _and_ can't re-create the cluster with a new version every time. This is something most people at Google are completely insulated from in the case of Borg, because SRE's make infrastructure "just work". I wish there was something drastically simpler. I don't need three dozen persistent volume providers, or the ability to e.g. replace my network plugin or DNS provider, or load balancer. I want a sane set of defaults built-in. I want easy access to persistent data (currently a bit of a nightmare to set up in your own cluster). I want a configuration setup that can take command line params without futzing with templating and the like. As horrible and inconsistent as Borg's BCL is, it's, IMO, an improvement over what K8S uses.Most importantly: I want a lot fewer moving parts than it currently has. Being "extensible" is a noble goal, but at some point cognitive overhead begins to dominate. Learn to say "no" to good ideas.Unfortunately there's a lot of K8S configs and specific software already written, so people are unlikely to switch to something more manageable. Fortunately if complexity continues to proliferate, it may collapse under its own weight, leaving no option but to move somewhere else.

评论 #18957551 未加载

评论 #18955545 未加载

评论 #18956725 未加载

评论 #18955521 未加载

评论 #18956508 未加载

评论 #18957108 未加载

评论 #18957484 未加载

评论 #18955447 未加载

评论 #18955560 未加载

评论 #19034334 未加载

评论 #18959439 未加载

评论 #18970258 未加载

评论 #18959221 未加载

评论 #18961099 未加载

评论 #18961827 未加载

评论 #18958985 未加载

dvnguyenover 6 years ago

Having used Docker Compose/Swarm for last two years, I remember having problems with them twice. One of which was an MTU setting which I didn't really understand why, but overall I was relatively happy with them. Since Kubernetes seems to have won, I decided to learn it but got some disappointments.The first disappointment is setting up a local development environment. I failed to get minikube running on a Macbook Air 2013 and a Ubuntu Thinkpad. Both have VTx enabled and Docker and VirtualBox running flawlessly. Their online interactive tutorial was good though, enough for the learning purpose.Production setup is a bigger disappointment. The only easy and reliable ways to have a production grade Kubernetes cluster are to lock yourself into either a big player cloud provider, or an enterprise OS (Redhat/Ubuntu), or introduce a new layer on top of Kubernetes [1]. Locking myself into enterprise Ubuntu/Redhad is expensive, and I'm not comfortable with adding a new, moving, unreliable layer on top of Kubernestes which is built on top of Docker. One thing I like about the Docker movement is that they commoditize infrastructure and reduce lock-ins. I can design my infrastructure so it can utilize an open source based cloud product first and easily move to others or self-host if needed. With Kubernetes, things are going the other way. Even if I never moved out of the big 3 (AWS/Azure/GCloud), the migration process could be painful since their Kubernetes may introduce further lock-ins for logging, monitoring, and so on.[1]: <a href="https://kubernetes.io/docs/setup/pick-right-solution/" rel="nofollow">https://kubernetes.io/docs/setup/pick-right-solution/</a>

评论 #18955112 未加载

评论 #18955432 未加载

评论 #18955532 未加载

评论 #18954704 未加载

评论 #18955896 未加载

评论 #18956724 未加载

评论 #18954891 未加载

评论 #18955144 未加载

cygnedover 6 years ago

I am a developer and I find k8s frustrating. To me, its documentation is confusing and scattered among too many places (best example: overlay networks). I have read multiple books and gazillions of articles and yet I have the feeling that I am lacking the bigger picture.I was able to set it up successfully a couple of times, with more or less time required. Last time, I gave up after four days because I realized that what I need was a "I just want to run a simple cluster" solution and while k8s might provide that, its flexibility makes it hard for me to use it.

评论 #18955404 未加载

评论 #18955642 未加载

评论 #18955611 未加载

评论 #18957545 未加载

评论 #18955965 未加载

manigandhamover 6 years ago

I don't understand all the negative comments here, K8S solves many problems regardless of scale. You get a single platform that can run namespaced applications using simple declarative files with consolidated logging, monitoring, load-balancing, and failover built-in. What company would not want this?

评论 #18956197 未加载

评论 #18962607 未加载

评论 #18956201 未加载

评论 #18962652 未加载

nisaover 6 years ago

The k8s hype feels like the Hadoop hype from a few years ago. Both solve problems that most don't have and there is a lot of complexity - some due to the nature of the problem, some because everything is new and moving.Of course it's 2019 and you have to migrate Hadoop to run on k8s now :)My impression is that if you are a small shop and have the money, use k8s on google and be happy, but don't attempt to set it up for yourself.If you only have a few dedicated boxes somewhere just use Docker Swarm and something like Portainer.

评论 #18955835 未加载

awinter-pyover 6 years ago

Beyond strictly runtime failures, 2018 feels like the year that most of my friends tried kube but not everybody stayed on.The adoption failures are mostly networking issues specific to their cloud. Performance and box limits vary widely depending on cloud vendor and I still don't quite understand the performance penalty of the different overlay networks / adapters.

评论 #18954328 未加载

评论 #18954598 未加载

评论 #18961749 未加载

stonewhiteover 6 years ago

I managed multiple mesos+marathon clusters on production a little over 1.5 years, and when I switched over to the K8s the only thing that felt like an improvement was the kubectl cli.I really liked/missed the beauty of simplicity in marathon that everything was a task, the load balancer, autoscaler, app servers everything. I think it failed because provisioning was not easy, lack of first-class integrations with cloud vendors and horrible horrible documentation.Kind of sad to see it lost the hype battle, and since then even Mesosphere had to come up with a K8s offering.

bdcravensover 6 years ago

I've started the planning phase of a Kubernetes course, geared toward developers more so than the enterprise gatekeepers. As I read stories like these, I jump between different thoughts and feelings:1) no matter what I think I know, there's too many dark corners to create an adequate course2) K8S is such a dumpster fire that I shouldn't encourage others3) there's a hell of an opportunity hereThoughts? Worth pursuing? Anything in particular that should be included that usually isn't in this kind of training?

评论 #18955208 未加载

评论 #18955407 未加载

评论 #18957707 未加载

评论 #18955062 未加载

评论 #18955166 未加载

评论 #18955047 未加载

stuntover 6 years ago

Kubernetes solves a problem that most of the companies don't have. That is why I don't understand why the hype around it is so big.For the majority, it just adds a little value when you compare to added complexity to infrastructure and the cost of a learning curve and the ongoing operation and maintenance.

评论 #18957495 未加载

评论 #18955564 未加载

评论 #18955491 未加载

评论 #18955554 未加载

评论 #18959743 未加载

评论 #18956143 未加载

tnoletover 6 years ago

I'd be interested in a related "microservices failure stories". Must be a big overlap with this.

评论 #18954225 未加载

评论 #18954074 未加载

hjacobsover 6 years ago

Christian already followed the example and created a similar list for Serverless: <a href="https://github.com/cristim/serverless-failure-stories" rel="nofollow">https://github.com/cristim/serverless-failure-stories</a>

评论 #18954430 未加载

评论 #18954024 未加载

dcompover 6 years ago

I run a single node cluster at home. In order to handle updates. I just wipe the cluster with kubeadm reset. Then kubeadm init; followed by running a simple bash script. which loops of files in nested subdirectories applying yaml configs. Only have to make sure I only ever edit the yaml files and not mess with kubectl edit etc.for f in /.yaml ...with a directory structure of:<pre><code> drwxrwsrwx+ 1 root 1002 176 Jan 20 21:15 . drwxrwsrwx+ 1 root 1002 194 Nov 17 20:06 .. drwxrwsrwx+ 1 root 1002 68 Jan 20 20:50 0-pod-network drwxrwsrwx+ 1 root 1002 104 Nov 1 11:18 1-cert-manager drwxrwsrwx+ 1 root 1002 34 Jul 11 2018 2-ingress -rwxrwxrwx+ 1 root 1002 93 Jan 20 21:15 apply-config.sh drwxrwsrwx+ 1 root 1002 22 Jul 14 2018 cockpit drwxrwsrwx+ 1 root 1002 36 Jul 3 2018 samba drwxrwsrwx+ 1 root 1002 76 Jul 6 2018 staticfiles</code></pre>

AaronFrielover 6 years ago

I just went through all of the post-mortems for my own company's purposes of evaluating Kubernetes. I've been running Kubernetes clusters for about a year and a half and have run into a few of these, but here's what I found striking:* About half of the post-mortems involve issues with AWS load balancers (mostly ELB, one with ALB) * Two of the post-mortems involve running control plane components dependent on consensus on Amazon's `t2` series nodesThis was pretty surprising to me because I've never run Kubernetes on AWS. I've run it on Azure using acs-engine and more recently AKS since its release, and on Google Cloud Platform using GKE; and it's a good reminder not to to run critical code on T series instances because AWS can and will throttle or pause these instances.

评论 #18958534 未加载

peterwwillisover 6 years ago

Dang. I wish I had my SRE Wiki up and running already, or I'd add a "public postmortems" section.

评论 #18954654 未加载

评论 #18954962 未加载

hjacobsover 6 years ago

There is now a Kubernetes podcast episode with me about the topic: <a href="https://kubernetespodcast.com/episode/038-kubernetes-failure-stories/" rel="nofollow">https://kubernetespodcast.com/episode/038-kubernetes-failure...</a>

15 comments

m0zgover 6 years ago

评论 #18957551 未加载

评论 #18955545 未加载

评论 #18956725 未加载

评论 #18955521 未加载

评论 #18956508 未加载

评论 #18957108 未加载

评论 #18957484 未加载

评论 #18955447 未加载

评论 #18955560 未加载

评论 #19034334 未加载

评论 #18959439 未加载

评论 #18970258 未加载

评论 #18959221 未加载

评论 #18961099 未加载

评论 #18961827 未加载

评论 #18958985 未加载

dvnguyenover 6 years ago

评论 #18955112 未加载

评论 #18955432 未加载

评论 #18955532 未加载

评论 #18954704 未加载

评论 #18955896 未加载

评论 #18956724 未加载

评论 #18954891 未加载

评论 #18955144 未加载

cygnedover 6 years ago

评论 #18955404 未加载

评论 #18955642 未加载

评论 #18955611 未加载

评论 #18957545 未加载

评论 #18955965 未加载

manigandhamover 6 years ago

评论 #18956197 未加载

评论 #18962607 未加载

评论 #18956201 未加载

评论 #18962652 未加载

nisaover 6 years ago

评论 #18955835 未加载

awinter-pyover 6 years ago

评论 #18954328 未加载

评论 #18954598 未加载

评论 #18961749 未加载

stonewhiteover 6 years ago

bdcravensover 6 years ago

评论 #18955208 未加载

评论 #18955407 未加载

评论 #18957707 未加载

评论 #18955062 未加载

评论 #18955166 未加载

评论 #18955047 未加载

stuntover 6 years ago

评论 #18957495 未加载

评论 #18955564 未加载

评论 #18955491 未加载

评论 #18955554 未加载

评论 #18959743 未加载

评论 #18956143 未加载

tnoletover 6 years ago

I'd be interested in a related "microservices failure stories". Must be a big overlap with this.

评论 #18954225 未加载

评论 #18954074 未加载

hjacobsover 6 years ago

评论 #18954430 未加载

评论 #18954024 未加载

dcompover 6 years ago

AaronFrielover 6 years ago

评论 #18958534 未加载

peterwwillisover 6 years ago

Dang. I wish I had my SRE Wiki up and running already, or I'd add a "public postmortems" section.

评论 #18954654 未加载

评论 #18954962 未加载

hjacobsover 6 years ago