科技回声

12 条评论

kache_大约 3 年前

The scale at which Meta operates at really boggles my mind. I work with an ex facebook guy who was on the infra side of things and the numbers he told me.. I couldn't even imagine. And I'm working on the order of magnitude of 100m/h, but still, completely different set of challenges.

评论 #30907528 未加载

评论 #30911439 未加载

londons_explore大约 3 年前

In a fleet of 100,000 machines, there will always be some clear failures... When the machine has 2x the number of segfaults of any other machine in the fleet, you send it for repairs and someone replaces the motherboard, ram and CPU... easy!But the painful ones are the 'subtle' failures. Why does machine PABL12 sometimes give NaN as a result while all 99,999 machines return sensible numbers? But all the burn in hardware tests pass...The solution was to simply exclude any machines that were outliers. Anything in the top or bottom 0.01% for any metric simply exclude that machine from future workloads.Sure, in most cases there was nothing wrong with the hardware, but when you're spending hours debugging some fault caused by a sometimes-bad floating point unit on one core of one machine out of 100,000, you're just wasting your time. By auto-banning outliers, the machine will end up doing some other task where data consistency matters less.

评论 #30908846 未加载

评论 #30910304 未加载

jgrahamc大约 3 年前

Some might enjoy this old Cloudflare debugging story about random crashes in production.<a href="https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/" rel="nofollow">https://blog.cloudflare.com/however-improbable-the-story-of-...</a>

评论 #30910807 未加载

评论 #30906740 未加载

notacoward大约 3 年前

To be clear, this is about corruption in the CPU/GPU/memory complex. There's a whole separate set of techniques (some of which I worked on) to detect and correct data corruption on disk.

评论 #30907270 未加载

Hnrobert42大约 3 年前

Interestingly, this site fails ungracefully (HTTP error code 500) when I try to visit from NordVPN, even after cycling through a few IP addresses. I’m noticing more and more sites block all VPN track. I get why, but it’s not good.

评论 #30908549 未加载

评论 #30907190 未加载

评论 #30907435 未加载

评论 #30911319 未加载

pnw大约 3 年前

The first thing I noticed about this article is that like all Facebook pages, it silently corrupted my back button.

评论 #30914832 未加载

PTOB大约 3 年前

I work on the physical side; building hyperscale datacenters. You guys should try your hand at managing errors in that system. You've got it all: memory leaks, thermal overloads, misallocated heaps, pipes with strong type requirements, dropped packets ... you name it.

评论 #30909601 未加载

mad44大约 3 年前

<a href="https://muratbuffalo.blogspot.com/2021/06/silent-data-corruptions-at-scale.html" rel="nofollow">https://muratbuffalo.blogspot.com/2021/06/silent-data-corrup...</a>

HL33tibCe7大约 3 年前

Completely off-topic digression: I still think the name change to “Meta” is a big mistake. Subjectively, for some reason I just really dislike the name. More objectively, the branding is very muddled, e.g: serving an “Engineering at Meta” blog post on fb.com.Often with these things it’s just about time; it feels wrong because you’re just not used to the change yet. Maybe that will happen, but it’s been months now. Usually with these changes I change my mind quicker than that.

评论 #30906928 未加载

评论 #30907055 未加载

评论 #30907054 未加载

评论 #30907064 未加载

评论 #30907491 未加载

RoboTeddy大约 3 年前

Computational proofs of integrity (STARKs, SNARKs) could detect silent data corruptions (at the cost of a ~1000x slowdown)I wonder if we’ll see them used for large scale applications whose correctness is critical.

raphaelj大约 3 年前

It would be better if Meta would focus on detecting spam at scale.I put a desk chair on Marketplace last Friday, and got 8 messages that were actually scams. These were trying to "schedule" a Fedex/DHL pickup, and would redirect me to fake branded websites that were requesting my personal details and bank account. This was so obviously fake it baffled me Meta can't detect these automatically.I am also getting multiple message requests per week asking from hookups. These are obviously fake [1].---[1] <a href="https://imgur.com/a/yZDPh3C" rel="nofollow">https://imgur.com/a/yZDPh3C</a>

评论 #30907242 未加载

评论 #30907494 未加载

评论 #30907310 未加载

评论 #30907681 未加载

评论 #30907223 未加载

评论 #30907486 未加载

tupac_speedrap大约 3 年前

Content seems interesting but the generic corporate image at the top, crap font and off-black low contrast text colour is getting on my nerves.

评论 #30906905 未加载

12 条评论

kache_大约 3 年前

评论 #30907528 未加载

评论 #30911439 未加载

londons_explore大约 3 年前

评论 #30908846 未加载

评论 #30910304 未加载

jgrahamc大约 3 年前

评论 #30910807 未加载

评论 #30906740 未加载

notacoward大约 3 年前

To be clear, this is about corruption in the CPU/GPU/memory complex. There's a whole separate set of techniques (some of which I worked on) to detect and correct data corruption on disk.

评论 #30907270 未加载

Hnrobert42大约 3 年前

评论 #30908549 未加载

评论 #30907190 未加载

评论 #30907435 未加载

评论 #30911319 未加载

pnw大约 3 年前

The first thing I noticed about this article is that like all Facebook pages, it silently corrupted my back button.

评论 #30914832 未加载

PTOB大约 3 年前

评论 #30909601 未加载

mad44大约 3 年前

<a href="https://muratbuffalo.blogspot.com/2021/06/silent-data-corruptions-at-scale.html" rel="nofollow">https://muratbuffalo.blogspot.com/2021/06/silent-data-corrup...</a>

HL33tibCe7大约 3 年前

评论 #30906928 未加载

评论 #30907055 未加载

评论 #30907054 未加载

评论 #30907064 未加载

评论 #30907491 未加载

RoboTeddy大约 3 年前

raphaelj大约 3 年前

评论 #30907242 未加载

评论 #30907494 未加载

评论 #30907310 未加载

评论 #30907681 未加载

评论 #30907223 未加载

评论 #30907486 未加载

tupac_speedrap大约 3 年前

Content seems interesting but the generic corporate image at the top, crap font and off-black low contrast text colour is getting on my nerves.

评论 #30906905 未加载

Meta quickly detects silent data corruptions at scale

12 条评论

Meta quickly detects silent data corruptions at scale

12 条评论