Yes, case number 384,449,194 of systems management causing a system problem. Also case number 439,224 of what looked like a localized problem quickly causing a huge system, e.g., all 23 data centers around the world, to crash.<p>They have my sympathy: So, they typed in a 'rule'. At one time I was working in 'artificial intelligence' (AI), actually 'expert systems', based on using 'rules' to implement real time management of server farms and networks. Of course, in that work, goals included 'lights out data centers', that is, don't need people walking around doing manual work but not the case of 'lights out' as in the CloudFlare outage, and very high reliability.<p>Looking into reliability, that is, putting into a few, broad categories the causes of outages, a category causing a large fraction of the outages was humans doing system management, or as in the words of the HAL 9000, "human error". Yup.<p>And the whole thing went down? Yup: One example we worked with was system management of a 'cluster'. Well, one of the computers in the cluster "went a little funny, a little funny in the head" and was throwing all its incoming work into its 'bit bucket'. So, the CPU busy metric on that computer was not very high, and the load leveling started sending nearly all the work to that one computer and, thus, into its bit bucket and, thus, effectively killed the work of the whole cluster.<p>As one response I decided that real time monitoring of a cluster, or any system that is supposed to be 'well balanced' via some version of 'load leveling', should include looking for 'out of balance' situations.<p>So, let's see: Such monitoring can have false positives (false alarms) and false negetives (missed detections). So, such monitoring is necessarily essentially a case of some statistical hypothesis testing, typically with the 'null hypothesis' that the system is healthy, applied continually in near real-time. So, for monitoring 'balancing', we will likely have to work with multi-dimensional data. Next, our chances of knowing the probability distribution of that data, even in the case of a healthy system, is from slim down to none. So we need a statistical hypothesis test that is both multi-dimensional and distribution-free.<p>So, CloudFlare's problems are not really new!<p>I went ahead and did some work, math, prototype software, etc. and maybe someday it will be useful, but it wouldn't have helped CloudFlare here if only because they needed no help noticing that all their systems around the world were crashing.<p>In our work on AI, at times we visited some high end sites, and in some cases we found some extreme, high up off the tops of the charts, concern and discipline for who, what, or why any humans could take any system management actions. E.g., they had learned the lesson that can't let someone just type in a new rule in a production system. Why? Because it was explained that one outage in a year, and the CIO would lose his bonus. Two outages and he would lose his job. Net, we're talking very high concern. No doubt CloudFlare will install lots of discipline around humans taking system management actions on their production systems.<p>Net, I can't blame CloudFlare. If my business gets big enough to need their services, they will be high on the list of companies I will call first!