TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Imaging a Hard Drive with Non-ECC Memory – What Could Go Wrong?

169 点作者 robertelder大约 2 年前

15 条评论

hayst4ck大约 2 年前
Bit flips are totally real, at scale you will definitely see them on large queries. There was a fun talk at DEFCON on bitsquatting, the process of buying 1 bit off domain names and then accepting all incoming connections. Attacks like rowhammer similarly abuse erroneous bit flips. Supposedly microsoft can detect solar activity based on the number of windows crash logs they receive.<p>DEFCON Talk: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=aT7mnSstKGs">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=aT7mnSstKGs</a><p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Bitsquatting" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Bitsquatting</a><p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Row_hammer" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Row_hammer</a>
评论 #35028427 未加载
评论 #35026456 未加载
评论 #35029265 未加载
评论 #35027210 未加载
johnklos大约 2 年前
ECC is good, and I genuinely wish it were more common. Thankfully, Ryzen CPUs support ECC by default (except for pre-7000 series with integrated graphics that aren&#x27;t &quot;Pro&quot; versions), so long as the motherboard does, too (like all ASRock that I&#x27;ve seen). I&#x27;m running several Ryzen servers with ECC.<p>On the other hand, there are many, many systems out there that don&#x27;t have ECC, nor do they have the option to have ECC. While every video on Youtube wants us to believe that the difference between 580 and 585 frames per second in some silly game or another makes all the difference in the world, for me the difference between a system that runs 10% slower and one that crashes in the middle of the night is actually significant. I test all my systems at a certain memory frequency, then back off to the next slower frequency just to be sure.<p>That doesn&#x27;t stop memory errors from happening, but most systems have lived their entire lives without having random crashes or random segfaulting. I consider that worthwhile.
评论 #35026774 未加载
评论 #35030074 未加载
评论 #35027082 未加载
评论 #35026761 未加载
评论 #35027844 未加载
评论 #35026553 未加载
评论 #35027078 未加载
评论 #35028230 未加载
tasty_freeze大约 2 年前
A bit over 20 years ago I had a PC with a memory stick that had gone bad, but not bad enough that it was crashing all the time ... it crashed often enough running windows 98 apps that I attributed all crashes to software nonsense.<p>Back then it was recommended to run a defragger every so often, so I set up a cron job to run it every Saturday night or something like that. The net result was that every file block that got moved made a trip through memory with some small probability of getting corrupted. Often the errors were in files that weren&#x27;t used that often so I didn&#x27;t immediately notice. The net result is that after many months of this, I started noticing PDF files that were corrupted, or mp3 files that would hiccup in the middle even though it used to play perfectly before. Sadly, I had ripped my 500-ish CD collection and then had gotten rid of the physical CDs.
ilyt大约 2 年前
That reminds me of how I accidentally tracked memory issue to the failing power supply.<p>I noticed (after some windows bluescreen) on memtest that the memory is showing some errors. Ordered another 16GB pair, replaced it and.... the problem persisted.<p>Suspecting something with motherboard I just chalked it to something with mobo and pretty much said &quot;well I&#x27;m not replacing mobo now, it will have to wait for next hardware refresh. Gaming PC so no big deal. And now I had 32 GB of RAM in PC.<p>Weirdly enough, problem only happened when running on multi-core memory test.<p>Cue ~1 year after and my power supply just... died. Guessing bad caps I just ordered another and thought nothing of it. On a whim I ran memtest and....<p>nothing. All fixed. Repeated few times and it was just fine, no bluescreen for ~ 2 years now too.<p>I definitely want to get next machine with ECC but the DDR4 consumer ECC situation looks... weird. I&#x27;m not sure whether I should be happy with on-chip ECC, I&#x27;d really prefer to have whole CPU-memory pipe ECCed
mnw21cam大约 2 年前
Two things. Firstly, I don&#x27;t think any conclusions can be made about whether dd or dd-rescue is more susceptible to bit flips. It could be that both allocated a buffer, and dd-rescue just happened to be handed the area of memory with the fault in it, which it reused multiple times, where when dd was run that area of memory was used by something else. Memory mapping and usage in a real operating system is highly non-deterministic due to the sheer amount of things that affect it.<p>Secondly, once a good list of known faulty memory addresses had been created by memtest, one can tell the operating system not to use them. Then you can keep using your old hardware without the reliability problems. Although, it is possible that further areas of memory will subsequently fail, and without ECC, you&#x27;ll still be vulnerable to random (cosmic ray-induced) bit flips.
latchkey大约 2 年前
I ran a cluster of ~30k blade based computers booting entirely off iPXE. They didn&#x27;t have any onboard ssd&#x2F;disk storage or ECC memory. Every day, a few of them would randomly lock up, they&#x27;d reboot with a fresh network image and keep on humming.
评论 #35026454 未加载
评论 #35026827 未加载
ta988大约 2 年前
I&#x27;ve had a lot of really strange bugs and data loss with my current build (Ryzen with Gskill memory). After running a memtest for 24h i finally saw that two of the four ram sticks were faulty (two bit flips on each only rarely and on a specific test). The company changed them but now a year later without any issues I have another one that failed in exactly the same way. This is the last time I build a non-ECC system for myself.
评论 #35027213 未加载
评论 #35026495 未加载
评论 #35026880 未加载
mtlmtlmtlmtl大约 2 年前
Amazing technical write up. But if there&#x27;s no cause for alarm based on SMART, I would just do the memtest right then because that&#x27;s always my goto for weird undiagnosed problems. I find it&#x27;s usually not the problem, although when it has been I&#x27;ve ended up wasting a silly amount of time on it(just like this case!).<p>And if there was cause for alarm, I would think long and hard about imaging from the original computer at all. With certain failure modes in drives, just reading could cause more corruption; each failed attempt could lose data.<p>But yeah, happy you did it this way in the end, because I learned a ton from the resulting blog post!
muro大约 2 年前
AFAICT, no current Mac comes with ECC - do they have the same issues? If so, one doesn&#x27;t hear about them too often.
评论 #35026687 未加载
评论 #35026558 未加载
T3OU-736大约 2 年前
<a href="https:&#x2F;&#x2F;youtu.be&#x2F;aPd8MaCyw5E" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;aPd8MaCyw5E</a> (&quot;ShmooCon 2014: You Don&#x27; Have The Evidence - Forensic Imaging Tools&quot;) was quite an eye-opening talk about common tools, like the article-mentioned `dd` (and its cousin `ddrescue`) and how they deal with I&#x2F;O errors.<p>To be clear, I do not believe that the tools are at fault - rather, the SATA&#x2F;SAS&#x2F;IDE controllers have a different design goal, and software tools can only do so much.<p>Tools like DeepSpar (HW+SW), PC-3000 (also HW+SW) allow for a scary level of nitty-gritty access to HW, including flashing SSD&#x2F;HDD controller FW in case in went pear-shaped), but for data recovery - be it in a forensic context, or in a context of retrieving important irreplaceable data, I have always had a nerd-lust for those tools. Used them at a previous job, but can&#x27;t ever justify the price for personal and very infrequent use. :)
undersuit大约 2 年前
&gt;Does increased heat increase the likelihood of memory errors? I think it does.<p>I just got through a round of overclocking my memory. Yes, heat does.<p>&gt;tRFC is the number of cycles for which the DRAM capacitors are &quot;recharged&quot; or refreshed. Because capacitor charge loss is proportional to temperature, RAM operating at higher temperatures may need substantially higher tRFC values.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;integralfx&#x2F;MemTestHelper&#x2F;blob&#x2F;oc-guide&#x2F;DDR4%20OC%20Guide.md#temperatures-and-its-effect-on-stability">https:&#x2F;&#x2F;github.com&#x2F;integralfx&#x2F;MemTestHelper&#x2F;blob&#x2F;oc-guide&#x2F;DD...</a>
WirelessGigabit大约 2 年前
This reminds me of a bug in Google Chrome that was attributed to flipped bit.<p>If anyone has the link, it&#x27;s missing from my collection...
1letterunixname大约 2 年前
This wasn&#x27;t run with a large enough sample size to be statistically valid.
moremetadata大约 2 年前
Moral of the story?<p>Upgrade to DDR5 ram the latest standard which has on-die ECC memory but is not as good at spotting bit flips unlike proper ECC memory with a separate extra data correction chip.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;DDR5_SDRAM#:~:text=Unlike%20DDR4%2C%20all%20DDR5%20chips,sending%20data%20to%20the%20CPU" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;DDR5_SDRAM#:~:text=Unlike%20DD...</a>.<p>Whilst Proper ECC ram chips and motherboards exist, I&#x27;m surprised that a cheaper but equally as good as Proper ECC solution doesn&#x27;t exist although I know some would argue that DDR5 is a step in the right direction of a marathon.<p>I guess the markets know best and chase the numbers, assuming they are also using Proper ECC memory, binary coded decimal and not floating point arithmetic which introduces errors, something central banks have been using for decades?<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Floating-point_error_mitigation" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Floating-point_error_mitigatio...</a>
评论 #35026721 未加载
评论 #35028306 未加载
评论 #35026504 未加载
rrdharan大约 2 年前
&gt; To even detect this, I needed the patience and discipline to verify the checksum on a 500GB file! Imagine how much more time I could have wasted if I didn&#x27;t bother to verify the checksum and made use of an important business document that contained one of the 14 bit flips?<p>Unpopular-opinion counterpoint - the odds of this actually happening are vanishingly unlikely. Many file formats have built-in integrity checks and tons of redundancies and waste. I wouldn&#x27;t want to risk handling extremely valuable private keys or conducting high value cryptocurrency transactions or something, I suppose, on a machine without ECC memory, but that just doesn&#x27;t really come up in most knowledge worker or end consumer scenarios.<p>The odds of actually getting bit by this in a way that matters to you are really low, which is why nobody cares.