Interesting stuff as usual Netflix; but, the identifying and reporting of issues discovered under these conditions is really excellent for the community. Kudos to you all.
I notice they're using spark standalone clusters.<p>I've had problems with executors dying when running under YARN, and it makes keeping track of running jobs and finding log output more difficult. So unless you really need to run on the same instances as other MR tech, it seems standalone is the way to go.<p>If only AWS provided EMR AMIs for spark standalone clusters I could switch...
I would like to know which language they are using with Spark: Python, Java or Scala. I have experience with both python and java and while python works very well for prototyping new Spark flows, once a flow becomes too complex I fall back to Java because of generics in RDDs.
I'm curious how certain services can survive Chaos Monkey. Memcached is one example; if you start destroying instances, you're going to stampede your persistent datastore to get that memcached replacement hot again.
Interesting title given that Netflix has just released pricing for their upcoming New Zealand release undercutting the local offering from Spark (formerly Telecom).