In a fleet of 100,000 machines, there will always be some clear failures... When the machine has 2x the number of segfaults of any other machine in the fleet, you send it for repairs and someone replaces the motherboard, ram and CPU... easy!<p>But the painful ones are the 'subtle' failures. Why does machine PABL12 sometimes give NaN as a result while all 99,999 machines return sensible numbers? But all the burn in hardware tests pass...<p>The solution was to simply exclude any machines that were outliers. Anything in the top or bottom 0.01% for any metric simply exclude that machine from future workloads.<p>Sure, in most cases there was nothing wrong with the hardware, but when you're spending hours debugging some fault caused by a sometimes-bad floating point unit on one core of one machine out of 100,000, you're just wasting your time. By auto-banning outliers, the machine will end up doing some other task where data consistency matters less.