An interesting read, but it doesn't look like there's too much exciting or novel here. (In a fundamental sense, that is. I'm sure there's all kinds of interesting nuts-and-bolts engineering that outsiders aren't privy to.) TLDR: use a replicated state machine to make scheduling decisions, and make all operations on the datacenter idempotent.<p>The hashing trick to mitigate spiky load distributions is cool, but that seems to be more about multi-tenancy than reliability.<p>I'm disappointed to see this article perpetuating the misconception that Paxos is a leader election algorithm. It <i>tries</i> to elect a leader for its own purposes, but Paxos itself behaves safely even if the election process goes temporarily amok; other systems built on top of it might not. If you want to provide the guarantee that only one scheduler instance is running at a time, you need to add a lease mechanism and make assumptions about clock synchrony. I'm sure the authors know this, but not mentioning it at all seems pretty sloppy.
Wonder if those guys checked out <a href="https://github.com/mesos/chronos" rel="nofollow">https://github.com/mesos/chronos</a> - it was the best solution I could find when I recently wanted to solve distributed, reliable Cron for us.
WE had a similar issue, although a different level of scale.<p>However Jenkins works as a good cron replacement. Although I'm not sure about the limit to the number of build slaves you attach to jenkins.