TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Silent data corruptions at scale (2021)

84 点作者 losfair超过 1 年前

7 条评论

userbinator超过 1 年前
Very interesting topic, but rather low on detail --- really wanted to see what those 60 lines of Asm that allegedly show a faulty CPU instruction were, and also surprised that it wasn't intermittent; in my experience, CPU problems usually are intermittent and heavily dependent upon prior state, and manually stepping through with a debugger has never shown the "1+1=3" type of situation they claim. That said, I wonder if LINPACK'ing would've found it, as that is known to be a very powerful stress-test with divisive opinions among the overclocking community; some, including me, claim that a system can never be considered stable it if fails LINPACK since that is essentially showing intermittent "1+1=3" behaviour, while others are fine with "occasional" discrepancies in its output since the system otherwise appears to be stable.
评论 #39438531 未加载
评论 #39442385 未加载
dang超过 1 年前
Related:<p><i>Meta quickly detects silent data corruptions at scale</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=30905636">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=30905636</a> - April 2022 (95 comments)<p><i>Silent Data Corruptions at Scale</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=27484866">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=27484866</a> - June 2021 (12 comments)
dataflow超过 1 年前
Google also had a &quot;Cores That Don&#x27;t Count&quot; paper on so-called &quot;mercurial cores&quot; <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=27378624">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=27378624</a> as well as a presentation <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=QMF3rqhjYuM" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=QMF3rqhjYuM</a>
评论 #39438428 未加载
ekelsen超过 1 年前
I wrote an article about these affecting LLM training at <a href="https:&#x2F;&#x2F;www.adept.ai&#x2F;blog&#x2F;sherlock-sdc" rel="nofollow">https:&#x2F;&#x2F;www.adept.ai&#x2F;blog&#x2F;sherlock-sdc</a>
评论 #39438436 未加载
评论 #39440396 未加载
opisthenar84超过 1 年前
Might be a noob question but for truly important data, couldn&#x27;t SDCs be detected by using ECC everywhere?
评论 #39438732 未加载
评论 #39438446 未加载
评论 #39438945 未加载
twhitmore超过 1 年前
Interesting. The corruption was in a math.pow() calculation, representing a compressed filesize prior to a file decompression step.<p>Compressing data, with the increased information density &amp; greater number of CPU instructions involved, seems obviously to increase the exposure to corruption&#x2F; bitflips.<p>What I did wonder was why compress the filesize as an exponent? One would imagine that representing as a floating-point exponent would take lots of cycles, pretty much as many bits, and have nasty precision inaccuracies at larger sizes.
SomeoneFromCA超过 1 年前
Interesting paper, but has some technical errors. First of all, they keep mentioning SRAM+ECC, instead of DRAM+ECC; you cannot use gcj to inspect assembly code generated for Java method, as it will be completely different from the code generated by Hotspot; you do not need all that acrobatics to get disasm of the method, you could just add an infinite loop to the code and attach gdb to the JVM process and inspect the code or dump the core.
评论 #39439597 未加载