Learning to operate Kubernetes reliably

357 pointsby mglukhovskyover 7 years ago

11 comments

KaiserProover 7 years ago

Much as it burns me to admit this, for this usecase, jenkins is king. <60 nodes and its perfect.At previous job, we had migrated from a nasty cron orchestration system to jenkins. It did a number of things including building software, batch generating thumbnails and moving data about on around 30 nodes, of which about 25 were fungible.Jenkins job builder meant that everything was defined in yaml, stored in git and was repeatable. A sane user environment meant that we could execute as user and inherit their environment. It has sensible retry logic, and lots of hooks for all your hooking needs. pipelines are useful for chaining jobs together.We _could_ have written them as normal jobs to be run somewhere in the 36k node farm, but that was more hassle than its worth. Sure its fun, but having to contend with sharing a box that's doing a fluid sim or similar, so we'd have to carve off a section anyway.However kuberenetes to _just_ run cron is a massive waste. It smacks of shiny new tool syndrome. seriously jenkins is a single day deployment. transplanting the cron jobs is again less than a day (assuming your slaves have got a decent environment.)So, with the greatest of respect, talking about building a business case is pretty moot when you are effectively wasting what appears to be > two man months on what should be a week long migration. Think gaffer tape, not carbon fibre bonded to aluminium.If however, the rest of the platform lives on kuberenetes, then I could see the logic, having all your stuff running on one platform is very appealing, especially if you have invested time in translating comprehensive monitoring into business relevant alerts.

评论 #15974768 未加载

评论 #15974988 未加载

评论 #15975858 未加载

评论 #15974924 未加载

评论 #15974149 未加载

评论 #15974270 未加载

评论 #15981879 未加载

评论 #15974865 未加载

评论 #15976028 未加载

评论 #15974847 未加载

评论 #15976618 未加载

评论 #15975485 未加载

评论 #15981118 未加载

评论 #15974251 未加载

评论 #15974173 未加载

alexebirdover 7 years ago

I always search for mentions of Hashicorp Nomad in the comments section of front-page Kubernetes articles like this. There are often few or no mentions, so I’d like to add a plug for the Hashistack.For some reason Nomad seems to get noticeably less publicity than some of the other Hashicorp offerings like Consul, Vault, and Terraform. In my opinion Nomad is right up there with them. The documentation is excellent. I haven’t had to fix any upstream issues in about a year of development on two separate Nomad clusters. Upgrading versions live is straightforward, and I rarely find myself in a situation where I can’t accomplish something I envisioned because Nomad is missing a feature. It schedules batch jobs, cron jobs, long running services, and system services that run on every node. It has a variety of job drivers outside of Docker.Nomad, Consul, Vault, and the Consul-aware Fabio load balancer run together to form most of what one might need for a cluster scheduler based deployment, somewhat reminiscent of the “do one thing well” Unix philosophy of composability.Certainly it isn’t perfect, but I’d recommend it to anyone who is considering using a cluster scheduler but is apprehensive about the operational complexity of the more widely discussed options such as Kubernetes.

评论 #15978329 未加载

评论 #15978335 未加载

评论 #15977623 未加载

评论 #15977880 未加载

评论 #15978088 未加载

评论 #15978163 未加载

mephitixover 7 years ago

Setting aside the k8s content itself, I love the way this article is written. It's not a typical tutorial or tips/tricks but takes you time-traveling through the experience of a big company adopting nascent tech. Lot of great things to take away even outside of the kubernetes tips.

评论 #15978625 未加载

robszumskiover 7 years ago

> “Sometimes when we do an etcd failover, the API server starts timing out requests until we restart it.”This is likely related a set of Kubernetes bugs [1][2] (and grpc[3]) that CoreOS is working diligently to get fixed. The first set of these, the endpoint reconciler[4], has landed in 1.9.More work is pending on the etcd client in Kubernetes. The good news is that the client is used everywhere, so one fix and all components will benefit.[1]: <a href="https://github.com/kubernetes/community/pull/939" rel="nofollow">https://github.com/kubernetes/community/pull/939</a> [2]: <a href="https://github.com/kubernetes/kubernetes/issues/22609" rel="nofollow">https://github.com/kubernetes/kubernetes/issues/22609</a> [3]: <a href="https://github.com/kubernetes/kubernetes/issues/47131" rel="nofollow">https://github.com/kubernetes/kubernetes/issues/47131</a> [4]: <a href="https://github.com/kubernetes/kubernetes/pull/51698" rel="nofollow">https://github.com/kubernetes/kubernetes/pull/51698</a>

评论 #15975651 未加载

scarface74over 7 years ago

I'm curious about what people think about HashiCorp's Nad bs Kubernetes.I chose Nomad because I'm already using Consul and I wanted to run raw .Net executables. Would it have been worth it to use Docker with .Net Core?Not trying to change my infrastructure now, but just curious about whether it is worth the time to play with it on the side.

评论 #15976250 未加载

YesThatTom2over 7 years ago

Such good writing style AND useful technical content. Why can't all blog posts be this good?

评论 #15976218 未加载

djsumdogover 7 years ago

I haven't been at a k8s shop yet, but at my last job we used Marathon (on DC/OS). I know you can run Kubernetes on DC/OS, but the default scheduler it comes with is Marathon.Is there an advantage to one over the other? It looks like in both cases, you need a platform team (at least 2, maybe 3 people; we had a large complex setup and had like 10) to setup things like K8s, DC/OS or Nomad, because they are complex systems with a lot of different components .. components like Flanel vs Weavenet vs some other container networks, handling storage volumes, labels and automatic configuration of HAProxy from them (marathon-lb on DC/OS).All schedulers (k8s, swarm, marathon) seems to use a json format for job information that's pretty specific, not only to the scheduler, but to the way other tooling is setup at your specific shop.

perfmodeover 7 years ago

Why do you need a 99.99% from job completion rate? Why not just design for failure and inevitable retries? Almost seems like you grant platform users a false sense of security by making it very reliable but not perfect.

评论 #15974130 未加载

评论 #15973869 未加载

评论 #15975252 未加载

ad_hominemover 7 years ago

How do you deal with sidecar containers in CronJobs (and regular batch Jobs) not terminating correctly?<a href="https://github.com/kubernetes/kubernetes/issues/25908" rel="nofollow">https://github.com/kubernetes/kubernetes/issues/25908</a>

评论 #15975717 未加载

asimpletuneover 7 years ago

What is the benefit of using Kubernetes over Mesos (or in conjunction with Mesos)?

评论 #15976507 未加载

评论 #15975474 未加载

评论 #15975312 未加载

minimaxirover 7 years ago

Kubernetes very recently added native Cronjob support: <a href="https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/" rel="nofollow">https://kubernetes.io/docs/concepts/workloads/controllers/cr...</a>How does Stripe's approach differ?

评论 #15974381 未加载