One important point that the author seems to have misunderstood is that Borg was the <i>predecessor</i> to the other two systems, not the successor. Borg went into production (running a bunch of websearch dedicated clusters) in late 2004, long before Mesos or Omega were around. Omega is/was an experimental replacement for Borg that was started much later, although I'm not sure how much production load it actually took over.
See also Google's blog post summarizing Borg -> Kubernetes improvements.<p><a href="http://blog.kubernetes.io/2015/04/borg-predecessor-to-kubernetes.html" rel="nofollow">http://blog.kubernetes.io/2015/04/borg-predecessor-to-kubern...</a>
Its interesting to see how other industries tackle the same problem.<p>VFX has essentially the same problem to google: a huge bunch of tasks that need to perform all at once.<p>However VFX only tend to have one data center, so they don;t need or want clustered scheduler.<p><a href="https://github.com/mikrosimage/openrendermanagement" rel="nofollow">https://github.com/mikrosimage/openrendermanagement</a>, Alfred and tractor from pixar, and framestore's FQ (which is faster and more efficient than Borg at job dispatch. ) Are a few good example of task management.
I know a lot about Mesos and Mesosphere's DCOS, so can comment on those:<p>* There are users of these systems that get 90+% cluster utilization.<p>* Pre-emptable tasks (e.g., best effort scheduling vs guaranteed SLA scheduling) will be landing in Mesos.<p>* Mesosphere is building advanced scheduling plug-ins that will use the new scheduling models to do oversubscription of a cluster, helping to drive utilization to the 90%+ range without the need for any special tooling. You can get an idea of some of the algorithms being employed by checking out the Kozyrakis/Delimitrou Quasar paper[1].<p>[1] <a href="http://csl.stanford.edu/~christos/publications/2014.quasar.asplos.pdf" rel="nofollow">http://csl.stanford.edu/~christos/publications/2014.quasar.a...</a>
Is anyone using these at scale but with a small team to support it? We have a 5-6k fleet of servers across 3 DCs + another 1.5k in AWS. I tried deploying Mesos with mixed results. I also experimented with CoreOS. Considering re-exploring XEN/VMWare.
I'm not a sysadmin but recently started using CoreOS to deploy small web apps. Could anyone explain to me like I'm 5 what's the difference between those cluster schedulers and something like CoreOS' fleet (<a href="https://github.com/coreos/fleet" rel="nofollow">https://github.com/coreos/fleet</a>)?