Much as it burns me to admit this, for this usecase, jenkins is king. <60 nodes and its perfect.<p>At previous job, we had migrated from a nasty cron orchestration system to jenkins. It did a number of things including building software, batch generating thumbnails and moving data about on around 30 nodes, of which about 25 were fungible.<p>Jenkins job builder meant that everything was defined in yaml, stored in git and was repeatable. A sane user environment meant that we could execute as user and inherit their environment. It has sensible retry logic, and lots of hooks for all your hooking needs. pipelines are useful for chaining jobs together.<p>We _could_ have written them as normal jobs to be run somewhere in the 36k node farm, but that was more hassle than its worth. Sure its fun, but having to contend with sharing a box that's doing a fluid sim or similar, so we'd have to carve off a section anyway.<p>However kuberenetes to _just_ run cron is a massive waste. It smacks of shiny new tool syndrome. seriously jenkins is a single day deployment. transplanting the cron jobs is again less than a day (assuming your slaves have got a decent environment.)<p>So, with the greatest of respect, talking about building a business case is pretty moot when you are effectively wasting what appears to be > two man months on what should be a week long migration. Think gaffer tape, not carbon fibre bonded to aluminium.<p>If however, the rest of the platform lives on kuberenetes, then I could see the logic, having all your stuff running on one platform is very appealing, especially if you have invested time in translating comprehensive monitoring into business relevant alerts.
I always search for mentions of Hashicorp Nomad in the comments section of front-page Kubernetes articles like this. There are often few or no mentions, so I’d like to add a plug for the Hashistack.<p>For some reason Nomad seems to get noticeably less publicity than some of the other Hashicorp offerings like Consul, Vault, and Terraform. In my opinion Nomad is right up there with them. The documentation is excellent. I haven’t had to fix any upstream issues in about a year of development on two separate Nomad clusters. Upgrading versions live is straightforward, and I rarely find myself in a situation where I can’t accomplish something I envisioned because Nomad is missing a feature. It schedules batch jobs, cron jobs, long running services, and system services that run on every node. It has a variety of job drivers outside of Docker.<p>Nomad, Consul, Vault, and the Consul-aware Fabio load balancer run together to form most of what one might need for a cluster scheduler based deployment, somewhat reminiscent of the “do one thing well” Unix philosophy of composability.<p>Certainly it isn’t perfect, but I’d recommend it to anyone who is considering using a cluster scheduler but is apprehensive about the operational complexity of the more widely discussed options such as Kubernetes.
Setting aside the k8s content itself, I love the way this article is written. It's not a typical tutorial or tips/tricks but takes you time-traveling through the experience of a big company adopting nascent tech. Lot of great things to take away even outside of the kubernetes tips.
> “Sometimes when we do an etcd failover, the API server starts timing out requests until we restart it.”<p>This is likely related a set of Kubernetes bugs [1][2] (and grpc[3]) that CoreOS is working diligently to get fixed. The first set of these, the endpoint reconciler[4], has landed in 1.9.<p>More work is pending on the etcd client in Kubernetes. The good news is that the client is used everywhere, so one fix and all components will benefit.<p>[1]: <a href="https://github.com/kubernetes/community/pull/939" rel="nofollow">https://github.com/kubernetes/community/pull/939</a>
[2]: <a href="https://github.com/kubernetes/kubernetes/issues/22609" rel="nofollow">https://github.com/kubernetes/kubernetes/issues/22609</a>
[3]: <a href="https://github.com/kubernetes/kubernetes/issues/47131" rel="nofollow">https://github.com/kubernetes/kubernetes/issues/47131</a>
[4]: <a href="https://github.com/kubernetes/kubernetes/pull/51698" rel="nofollow">https://github.com/kubernetes/kubernetes/pull/51698</a>
I'm curious about what people think about HashiCorp's Nad bs Kubernetes.<p>I chose Nomad because I'm already using Consul and I wanted to run raw .Net executables. Would it have been worth it to use Docker with .Net Core?<p>Not trying to change my infrastructure now, but just curious about whether it is worth the time to play with it on the side.
I haven't been at a k8s shop yet, but at my last job we used Marathon (on DC/OS). I know you can run Kubernetes on DC/OS, but the default scheduler it comes with is Marathon.<p>Is there an advantage to one over the other? It looks like in both cases, you need a platform team (at least 2, maybe 3 people; we had a large complex setup and had like 10) to setup things like K8s, DC/OS or Nomad, because they are complex systems with a lot of different components ..
components like Flanel vs Weavenet vs some other container networks, handling storage volumes, labels and automatic configuration of HAProxy from them (marathon-lb on DC/OS).<p>All schedulers (k8s, swarm, marathon) seems to use a json format for job information that's pretty specific, not only to the scheduler, but to the way other tooling is setup at your specific shop.
Why do you need a 99.99% from job completion rate? Why not just design for failure and inevitable retries? Almost seems like you grant platform users a false sense of security by making it very reliable but not perfect.
How do you deal with sidecar containers in CronJobs (and regular batch Jobs) not terminating correctly?<p><a href="https://github.com/kubernetes/kubernetes/issues/25908" rel="nofollow">https://github.com/kubernetes/kubernetes/issues/25908</a>