The scale at which Meta operates at really boggles my mind. I work with an ex facebook guy who was on the infra side of things and the numbers he told me.. I couldn't even imagine. And I'm working on the order of magnitude of 100m/h, but still, completely different set of challenges.
In a fleet of 100,000 machines, there will always be some clear failures... When the machine has 2x the number of segfaults of any other machine in the fleet, you send it for repairs and someone replaces the motherboard, ram and CPU... easy!<p>But the painful ones are the 'subtle' failures. Why does machine PABL12 sometimes give NaN as a result while all 99,999 machines return sensible numbers? But all the burn in hardware tests pass...<p>The solution was to simply exclude any machines that were outliers. Anything in the top or bottom 0.01% for any metric simply exclude that machine from future workloads.<p>Sure, in most cases there was nothing wrong with the hardware, but when you're spending hours debugging some fault caused by a sometimes-bad floating point unit on one core of one machine out of 100,000, you're just wasting your time. By auto-banning outliers, the machine will end up doing some other task where data consistency matters less.
Some might enjoy this old Cloudflare debugging story about random crashes in production.<p><a href="https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/" rel="nofollow">https://blog.cloudflare.com/however-improbable-the-story-of-...</a>
To be clear, this is about corruption in the CPU/GPU/memory complex. There's a whole separate set of techniques (some of which I worked on) to detect and correct data corruption on disk.
Interestingly, this site fails ungracefully (HTTP error code 500) when I try to visit from NordVPN, even after cycling through a few IP addresses. I’m noticing more and more sites block all VPN track. I get why, but it’s not good.
I work on the physical side; building hyperscale datacenters. You guys should try your hand at managing errors in that system. You've got it all: memory leaks, thermal overloads, misallocated heaps, pipes with strong type requirements, dropped packets ... you name it.
Completely off-topic digression: I still think the name change to “Meta” is a big mistake. Subjectively, for some reason I just really dislike the name. More objectively, the branding is very muddled, e.g: serving an “Engineering at Meta” blog post on fb.com.<p>Often with these things it’s just about time; it feels wrong because you’re just not used to the change yet. Maybe that will happen, but it’s been months now. Usually with these changes I change my mind quicker than that.
Computational proofs of integrity (STARKs, SNARKs) could detect silent data corruptions (at the cost of a ~1000x slowdown)<p>I wonder if we’ll see them used for large scale applications whose correctness is critical.
It would be better if Meta would focus on detecting spam at scale.<p>I put a desk chair on Marketplace last Friday, and got 8 messages that were actually scams. These were trying to "schedule" a Fedex/DHL pickup, and would redirect me to fake branded websites that were requesting my personal details and bank account. This was so obviously fake it baffled me Meta can't detect these automatically.<p>I am also getting multiple message requests per week asking from hookups. These are obviously fake [1].<p>---<p>[1] <a href="https://imgur.com/a/yZDPh3C" rel="nofollow">https://imgur.com/a/yZDPh3C</a>