TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

ECC matters

1053 pointsby rajesh-sover 4 years ago

61 comments

nostrademonsover 4 years ago
I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering &quot;Not pushing for ECC memory.&quot;<p>Google&#x27;s initial strategy (c. 2000) around this was to save a few bucks on hardware, get non-ECC memory, and then compensate for it in software. It turns out this is a terrible idea, because if you can&#x27;t count on memory being robust against cosmic rays, you also can&#x27;t count on the software being stored in that memory being robust against cosmic rays. And when you have thousands of machines with petabytes of RAM, those bitflips do happen. Google wasted many man-years tracking down corrupted GFS files and index shards before they finally bit the bullet and just paid for ECC.
评论 #25625191 未加载
评论 #25624788 未加载
评论 #25624691 未加载
评论 #25628456 未加载
评论 #25626230 未加载
评论 #25624213 未加载
评论 #25628336 未加载
评论 #25624068 未加载
dijitover 4 years ago
I beg this, every time this conversation comes up it’s the same answer “I don’t see a problem”.<p>It’s so easy to chalk these kind of errors to other issues, a little corruption here, a running program goes bezerk there- could be a buggy program or a little accidental memory overwrite. Reboot will fix it.<p>But I ran many thousands of physical machines, petabytes of RAM, I tracked memory flip errors and they were _common_; common even in: less dense memory, in thick metal enclosures surrounded by mesh. Where density and shielding impacts bitflips a lot.<p>My own experience tracking bitflips across my fleet led me to buy a Xeon laptop with ECC memory (precision 5520) and it has (anecdotally) been significantly more reliable than my desktop.
评论 #25622631 未加载
评论 #25623343 未加载
评论 #25623237 未加载
评论 #25623110 未加载
评论 #25622790 未加载
评论 #25623216 未加载
评论 #25625450 未加载
cbanekover 4 years ago
As someone who has had to read thousands of random game crash reports from all over the interwebs (you know when Windows says you might want to send that crash log? like that), I totally agree.<p>Of all the things to be worried about, like OS bugs, bad hardware configuration, etc. bad memory is one of those really troubling things. You look at the code and say &quot;it&#x27;s can&#x27;t make it here, because this was set&quot; but when you can&#x27;t trust your memory you can&#x27;t trust anything.<p>And as the timeline goes to infinity, you may also get one of these reports and be asked to fix it... good luck.
评论 #25626613 未加载
评论 #25624507 未加载
评论 #25625454 未加载
评论 #25624040 未加载
评论 #25623663 未加载
评论 #25625415 未加载
评论 #25622936 未加载
评论 #25626440 未加载
zdwover 4 years ago
Good news is that for DDR5, ECC is a required part of the spec and should be a feature of every module:<p><a href="https:&#x2F;&#x2F;www.anandtech.com&#x2F;show&#x2F;15912&#x2F;ddr5-specification-released-setting-the-stage-for-ddr56400-and-beyond" rel="nofollow">https:&#x2F;&#x2F;www.anandtech.com&#x2F;show&#x2F;15912&#x2F;ddr5-specification-rele...</a>
评论 #25623084 未加载
评论 #25624104 未加载
评论 #25624128 未加载
评论 #25623205 未加载
评论 #25623032 未加载
评论 #25624092 未加载
评论 #25623094 未加载
simiasover 4 years ago
I used to be pretty skeptical of ECC for consumer-grade hardware, mainly because I felt that I&#x27;d always prefer cheaper&#x2F;more RAM over ECC RAM even if it meant that I&#x27;d get a couple of crash every year due to rogue bitflips. For servers it&#x27;s a different story, but for a desktop I&#x27;m fine dealing with some instability for better performance.<p>But these days with the RAM density being so high and bitflipping attacks being more than a theoretical threat it seems like there&#x27;s really no good reason not to switch to ECC everywhere.
评论 #25622735 未加载
评论 #25632207 未加载
评论 #25622608 未加载
otterleyover 4 years ago
About 1&#x2F;3 of Google&#x27;s machines and 8% of Google&#x27;s DIMMs in their fleet suffer at least one correctible memory error per year: <a href="http:&#x2F;&#x2F;www.cs.toronto.edu&#x2F;~bianca&#x2F;papers&#x2F;sigmetrics09.pdf" rel="nofollow">http:&#x2F;&#x2F;www.cs.toronto.edu&#x2F;~bianca&#x2F;papers&#x2F;sigmetrics09.pdf</a>
评论 #25624689 未加载
petermcneeleyover 4 years ago
I would also add that Row Hammer Attacks are much harder on ECC.<p>When I first tried to replicate the row hammer attack I was not getting any results. Turns out I was doing this on ECC. On non ECC memory the same test easily replicated the row hammer attack.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Row_hammer" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Row_hammer</a>
kensaiover 4 years ago
“ECC availability matters a lot - exactly because Intel has been instrumental in killing the whole ECC industry with it&#x27;s horribly bad market segmentation.”<p>Its.<p>There, I finally corrected Linus Torvalds in something. :))
评论 #25622776 未加载
评论 #25623026 未加载
评论 #25623488 未加载
评论 #25622565 未加载
评论 #25622471 未加载
MarkusWandelover 4 years ago
This is one justified Linus rant! My personal history includes data loss twice because of defective RAM, and many more RAMs discarded after the now obligatory overnight run of MemTest86+ (these were all secondhand RAMs - I would never buy a new one without a refund guarantee). My very first &quot;PC&quot; still had the ECC capability and I used it. My own now very dated rant on the subject: <a href="http:&#x2F;&#x2F;wandel.ca&#x2F;homepage&#x2F;memory_rant.html" rel="nofollow">http:&#x2F;&#x2F;wandel.ca&#x2F;homepage&#x2F;memory_rant.html</a>
评论 #25625477 未加载
评论 #25631302 未加载
otterleyover 4 years ago
D. J. Bernstein (of qmail&#x2F;daemontools fame) spoke of it over a decade ago as well. <a href="https:&#x2F;&#x2F;cr.yp.to&#x2F;hardware&#x2F;ecc.html" rel="nofollow">https:&#x2F;&#x2F;cr.yp.to&#x2F;hardware&#x2F;ecc.html</a>
评论 #25623279 未加载
linsomniacover 4 years ago
This reminds me of last year we ordered a new $14K server, it arrived and we ran it through our burn-in process which included running memtest86 on it, and it would, after around 7 hours, generate errors.<p>Support was only interested if their built-in memory tester, which even on it&#x27;s most thorough, would only run for ~3 hours, would show errors, which it wouldn&#x27;t. IIRC, the BMC was logging &quot;correctable memory errors&quot;, but I may be misremembering that.<p>&quot;We&#x27;ve run this test on every server we&#x27;ve gotten from you, including several others that were exactly the same config as this, this is the only one that&#x27;s ever thrown errors&quot;. Usually support is really great, but they really didn&#x27;t care in this case.<p>We finally contacted sales. &quot;Uh, how long do we have to return this server for a refund?&quot; All of a sudden support was willing to ship us out a replacement memory module (memtest86 identified which slot was having the problem), which resolved the problem.<p>They were all too willing to have us go to production relying on ECC to handle the memory error.
评论 #25628163 未加载
dborehamover 4 years ago
You don&#x27;t need to look at kernel crashes to speculate about bus and memory errors -- just check the logs on a few systems that do have ecc. Pretty soon you&#x27;ll see correctable errors being reported.
评论 #25625858 未加载
JoeAltmaierover 4 years ago
ECC works if done right. Accessing a memory location can fix bit-flips (ECC is a &#x27;correcting&#x27; code). But systems that don&#x27;t regularly visit every memory location, can accumulate risk. Those dark corners of RAM can eventually get double-bit errors and be uncorrectable. So an OS might &#x27;wash&#x27; RAM during idle moments, reading every location in a round-robin manner to get ECC to kick in and auto-correct. Doesn&#x27;t matter how fast (1M every hour or whatever) as long as somehow ECC has a chance to work.
评论 #25623459 未加载
评论 #25623149 未加载
评论 #25623854 未加载
评论 #25623052 未加载
jkuriaover 4 years ago
For those, like me, wondering what ECC is, here&#x27;s an explanation:<p><a href="https:&#x2F;&#x2F;www.tomshardware.com&#x2F;reviews&#x2F;ecc-memory-ram-glossary-definition,6013.html" rel="nofollow">https:&#x2F;&#x2F;www.tomshardware.com&#x2F;reviews&#x2F;ecc-memory-ram-glossary...</a>
KingMachiavelliover 4 years ago
Is there such a thing as &#x27;software&#x27; ECC where a segment in memory also has a checksum stored in memory and the CPU just verifies it when the memory segment is accessed?<p>It would be a lot slower than real ECC but it could just be used for operations that would be especially vulnerable to bit flips. It would also not know for certain if the memory segment of data or the memory segment holding the checksum was corrupted besides their relative sizes (checksum is much smaller so more unlikely to have had a bit flip in it&#x27;s memory region).
评论 #25625402 未加载
评论 #25623875 未加载
freeqazover 4 years ago
I bought ECC RAM for my laptop and it definitely was about 4x the price. It&#x27;s valuable to me for a few reasons -- peace of mind being a big one.<p>Bit flips happen and are real. I really wish ECC was plentiful and not brutally expensive!
评论 #25622636 未加载
评论 #25625634 未加载
评论 #25623298 未加载
评论 #25624732 未加载
phhover 4 years ago
I don&#x27;t know if ECC is that important, but reliability of RAM (or any storage) feels pretty crazy to me. 128GB being refreshed every second for a month error requires that the per-bit refresh process has a reliability of 99.9999999999999999% to be flawless. Considering we are dealing with quantum effects (which are inherently probabilistic), I wouldn&#x27;t trust myself to design anything like that.<p>Now back to ECC, I&#x27;ll probably be corrected, but I don&#x27;t think ECC helps gain more than two order of magnitudes, so we still need incredibly reliable RAM. If we move to ECC RAM by default everywhere, aren&#x27;t we simply going to get less reliable RAM at the end?
评论 #25622604 未加载
评论 #25622826 未加载
johnklosover 4 years ago
From the fortune database:<p>As far as we know, our computer has never had an undetected error. -- Weisert
londons_exploreover 4 years ago
I simply care that my computer executes code perfectly. Let&#x27;s settle on &quot;one instance of unintended behaviour per hundred years&quot; for that metric.<p>If it needs ECC memory to do that, then fit it with ECC memory. If there are other ways to achieve that (for example deeper dram cells to be more robust to cosmic rays) that&#x27;s fine too.<p>Just meet the reliability spec - I don&#x27;t care how.
评论 #25622570 未加载
评论 #25623940 未加载
paulie_aover 4 years ago
There was a great defcon talk a while back regarding using ECC. The concept was called &quot;dns jitter&quot;<p>Basically you can register domains using small bit differences for domains and start getting email and such for that domain<p>If I recall correctly the example given was a variation of microsoft.com<p>All because so much equipment doesn&#x27;t use ECC
评论 #25622865 未加载
评论 #25622807 未加载
MAXPOOLover 4 years ago
Well shit.<p>I run some large ML models in my home PC and I get NaN&#x27;s and some out of range floats every month or so. I have spent hours debugging but doing the same computation with the same random seeds does not recreate the problem.<p>How about GPU&#x27;s and their GDDR SDRAM? Do they have parity bits?
评论 #25624398 未加载
评论 #25623918 未加载
spacedcowboyover 4 years ago
Seems likely that “bad ram” was the reason for the recent AT&amp;T fiber issues, given that 1 bit was being flipped reliably in data packets [1]<p>[1]: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;catfish_man&#x2F;status&#x2F;1335373029245775872?lang=en" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;catfish_man&#x2F;status&#x2F;1335373029245775872?l...</a>
评论 #25623414 未加载
评论 #25622750 未加载
louwrentiusover 4 years ago
ECC matters, even on the desktop, it&#x27;s not even a discussion, to me.<p>If you think it doesn&#x27;t matter: how do you know? If you don&#x27;t run with ECC memory, you&#x27;ll never know if memory was corrupted (and recovered).<p>That blue screen, that sudden reboot, that program crashing. That corrupted picture of your kid.<p>Who knows.<p>I&#x27;ll tell you, who knows. God damn every sysadmin (or the modern equivalent) can tell you how often they get ECC errors. And at even a small scale you&#x27;ll encounter them. I have, on servers and even on an SAN Storage controller, for crying out loud.<p>If you care about your data, use ECC memory in your computers.
评论 #25622860 未加载
评论 #25623039 未加载
knorkerover 4 years ago
I have multiple times postponed buying new computers for YEARS, because I&#x27;m waiting for intel to get their head out of their ass and actually let me buy something that does ECC for desktop. (incl laptops)<p>I would have bought computers when I &quot;wanted one&quot;. Now I buy them when I <i>need</i> one. Because buying a non-ECC computer just feels like buying a defective product.<p>In the last 10 years I would have bought TWICE as many computers if they hadn&#x27;t segmented their market.<p>Fuck intel. I sense that Linus self-censored himself in this post, and like me is even angrier than the text implies.
评论 #25624400 未加载
评论 #25625397 未加载
1996over 4 years ago
Linus is absolutely right.<p>I am trying to get a laptop with dual NVMe (for ZFS) and ECC RAM. I can&#x27;t get that, at all - even without the other fancy things I would like such as a 4k OLED with pen&#x2F;touchscreen.<p>In 2020, even the Dell XPS stopped shipping OLED (goodbye dear 7390!)<p>I will gladly give my money to anyone who sells AMD laptop with ECC. Hopefully, it will show there&#x27;s demand for &quot;high end yet non bulky laptops&quot;
评论 #25622737 未加载
IgorPartolaover 4 years ago
I wish this was more of a cohesive argument. He says he thinks it’s important and points to row-hammer problems but doesn’t explain why. Probably because the audience it was written for already knows the arguments of why, but this is not the best argument.<p>If in doubt, get ECC. Do your own research on how it works and why. This post won’t explain it, just will blame Intel (probably rightfully so).
评论 #25622476 未加载
评论 #25622691 未加载
tgbugsover 4 years ago
A relevant Bryan Cantrill talk segment on this, which heightens the paranoia around this. Namely, firmware hiding correctable errors and only reporting uncorrectable errors.<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?t=2104&amp;v=fE2KDzZaxvE" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?t=2104&amp;v=fE2KDzZaxvE</a>
type0over 4 years ago
Consumer awareness about ECC needs to be better, with recent security implications I simply can&#x27;t understand why more motherboard manufacturers don&#x27;t support it on AMD. Intel of course is all to blame on the blue side, I stopped buying their overpriced Xeons because of this.
评论 #25624240 未加载
kozakover 4 years ago
I&#x27;m about to write some code that will allocate a random buffer, take a checksum of it, and just sit on the buffer, periodically checksuming it again until a bit flips. Or maybe even allocate a buffer of zeros and wait until a non-zero appears in it.
FartyMcFarterover 4 years ago
Does anyone know why ECC memory requires the CPU to support it?<p>Naively, I can understand why error <i>reporting</i> has dependencies on other parts of the system, but it would seem possible for error <i>correction</i> to work transparently.
评论 #25624508 未加载
评论 #25623341 未加载
评论 #25623366 未加载
vlovich123over 4 years ago
A couple of years ago there was advancements that claimed to make Rowhammer work on ECC RAM even with DDR4 [1]. Is that no longer a concern for some reason?<p>I would think the only guaranteed solutions to Rowhammer are actually cryptographic digests and&#x2F;or guard pages.<p>[1] <a href="https:&#x2F;&#x2F;www.zdnet.com&#x2F;article&#x2F;rowhammer-attacks-can-now-bypass-ecc-memory-protections&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.zdnet.com&#x2F;article&#x2F;rowhammer-attacks-can-now-bypa...</a>
评论 #25623734 未加载
ameliusover 4 years ago
Does Apple use ECC in its M1 laptop?
评论 #25622555 未加载
评论 #25623099 未加载
评论 #25640285 未加载
greyhairover 4 years ago
ECC is required on mission critical hardware.<p>I have spent 36 years fielding embedded devices in core network (D1&#x2F;E1, SONET, ROADM&#x2F;MPLS, Cellular basestation) and I will tell you that large ECC covered memory arrays always show small numbers of correctable error events over the course of a year. I have seen, over the course of my career, exactly one controller card replaced early in the field, because it started throwing excessive recoverable ECC events over time, until it hit a threshold of 10x the average of a typical board. On the order of ten recoverable ECC events per month instead of one event per month. I have never observed a logged non-correctable ECC event in the field. In the lab, yes, but never in fielded equipment.<p>If you are fine with your PC experiencing one or two bits flipped in memory every month, then you really don&#x27;t need ECC. That is the question you need to answer.<p>For mission critical systems? ECC is a requirement.
willis936over 4 years ago
Whenever this topic comes up I wonder how much more resilient are CPU registers compared to DRAM.
MisterTeaover 4 years ago
&gt; ECC availability matters a lot - exactly because Intel has been instrumental in killing the whole ECC industry with it&#x27;s horribly bad market segmentation.<p>The phrase that strikes me is &quot;horribly bad market segmentation&quot;. I agree 100%.<p>Remember when the Pentium&#x2F;pro&#x2F;2&#x2F;3 could operate in single and dual socket configurations with ECC? The same CPU that plugged into your low end consumer board could also plug into a high end server&#x2F;workstation board. All you needed was the right motherboard.
_0ffhover 4 years ago
Please someone correct me if I&#x27;m wrong, but as far as I can remember memory with extra capacity for error detection used to be a rather common thing on early PCs. That really only changed a couple of decades in, in order to be able to offer lower prices to home users who didn&#x27;t know or care about the difference. Probably about the time, or earlier, when with some hard disk manufacturers megabytes suddenly shrunk to 10^6 bytes (before kibibytes or mebibytes where a thing, btw).
评论 #25624535 未加载
wicketover 4 years ago
Over the years, I don&#x27;t think I&#x27;ve ever been able to explain to anyone that their memory error could have been caused a cosmic ray without being laughed at.
mauri870over 4 years ago
In case the page os not loading, refer to the wayback machine[1] for a copy<p>[1] <a href="https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;*&#x2F;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;forum&#x2F;?threadid=198497&amp;curpostid=198647" rel="nofollow">https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;*&#x2F;https:&#x2F;&#x2F;www.realworldtech.com&#x2F;...</a>
jhoechtlover 4 years ago
I definitely do not want Linus Torvalds yelling at me in that tone --- but reading his utterings is certainly entertaining.
aborsyover 4 years ago
For the average user, what’s the impact of bit flips in memory in practical terms?<p>I am not talking about servers dealing with critical data.<p>Suppose that I maintain a repository (documents, audio and video), one copy in a ZFS-ECC system and one in an ext4-nonECC system.<p>Would I notice a difference between these two copies after 5-10 years?<p>That tells us if ECC matters for most people.
评论 #25623680 未加载
评论 #25623557 未加载
arendtioover 4 years ago
It would be interesting to see how many more kernel oops appear on machines without ECC compared to those with ECC.
indoleringover 4 years ago
My favorite example is a bit flip altering election results:<p><a href="https:&#x2F;&#x2F;www.wnycstudios.org&#x2F;podcasts&#x2F;radiolab&#x2F;articles&#x2F;bit-flip" rel="nofollow">https:&#x2F;&#x2F;www.wnycstudios.org&#x2F;podcasts&#x2F;radiolab&#x2F;articles&#x2F;bit-f...</a>
trissylegsover 4 years ago
When I chose my PC parts when Ryzen first came out I tried to get ECC parts. The RAM was obtainable, the problem was that no motherboards had ECC support at the time. I hope the situation has improved by the time I get my next motherboard&#x2F;cpu upgrade.
elgfareover 4 years ago
For those out of the loop like me, ECC does indeed stand for error correcting code. <a href="https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;ECC_memory" rel="nofollow">https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;ECC_memory</a>
nix23over 4 years ago
I always have that conversation when ZFS comes up. Some peoples think ZFS NEEDS ECC, but in fact ZFS needs ECC much as every single one FS in Linux. And every single reliable Machine needs ECC.
ratiolatover 4 years ago
I have: Asus PRIME A520M-K Motherboard 2x M391A2K43DB1-CVF (Samsung 16GiB ECC Unbuffered RAM) AMD Ryzen 5 3600<p>I specifically was looking for bang for buck, low(er) wattage and ECC.
评论 #25623979 未加载
unixheroover 4 years ago
Fantastic burn by Linus Torvalds whom also had some skin in the CPU game.<p>Offtopic, I wonder if he trawls that site regularly. And eventually I wonder, is he here also? :)
Noxmilesover 4 years ago
I was reading it and thought: wow, this guy is absolutely right! Great things he&#x27;s talking about. After reading it, i saw it was Linux Torvalds :D
raghavtoshniwalover 4 years ago
Once trained a GPT2 model to do text-gen on Linus’ emails. Boy there were some choice angry rants and non-sensical technical jargon that was generated
z3t4over 4 years ago
Memory often comes with lifetime guarantees. If they had ECC it would be much easier to detect bad memory...
JumpCrisscrossover 4 years ago
What is the status of ECC on Macs?
评论 #25623989 未加载
qwerty456127over 4 years ago
ECC should be everywhere. It seems outrageous to me almost no laptops have ECC.
belzebalexover 4 years ago
Asked myself, would it be possible to build a Geiger counter with RAM?
rafaelturkover 4 years ago
Little bit offtopic: Again seems that Intel? what?! is the one lowering the bar.
b0rsukover 4 years ago
I browsed some online listings for ECC memory modules, and they seem to be sold one module at a time. Standard DDR4 modules are sold in pairs, to benefit from dual channel mode.<p>Does ECC memory support dual channel??
srtjstjsjover 4 years ago
I guess Linus&#x27;s recent project to communicate more respectfully didn&#x27;t pan out.
musingsoleover 4 years ago
It&#x27;s a shame we don&#x27;t have ECC for individuals. How many of society&#x27;s bugs come from someone wandering around with a bit flipped?
rahimialiover 4 years ago
I have trouble parsing information from this rant. Is someone willing to translate this into an argument (a string of facts tied by logical steps)?
评论 #25623554 未加载
wagslaneover 4 years ago
It really does. I did a write-up recently on it as I was diving in and understanding the benefits: <a href="https:&#x2F;&#x2F;qvault.io&#x2F;2020&#x2F;09&#x2F;17&#x2F;very-basic-intro-to-elliptic-curve-cryptography&#x2F;" rel="nofollow">https:&#x2F;&#x2F;qvault.io&#x2F;2020&#x2F;09&#x2F;17&#x2F;very-basic-intro-to-elliptic-cu...</a>
评论 #25625138 未加载
sally1620over 4 years ago
Linux is accusing Intel of killing ECC intentionally. But that is not really the case, they just wanted people to pay up.<p>If you care about ECC, you pay for Xeon. Majority of consumers don&#x27;t run critical applications on their devices, so they are happy with a cheap device that may crash once in a while.<p>AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.<p>Honestly, what is the point of bullet-proof hardware when the software reliability (at least on consumer devices) has gone down to two nines.
评论 #25641754 未加载
评论 #25628061 未加载
sys_64738over 4 years ago
ECC memory is predominantly used in servers where failure absolutely must be identified and logged. The desktop market to a lesser extent due to lack of mission critical tasks being run from there.
评论 #25622580 未加载