科技回声

9 条评论

viraptor超过 2 年前

I wonder how well that paper holds up over a decade later. It reviewed DDR1/2 in 2009. I like to ask people running ECC to check their error counters. (on Linux `edac-util -rfull`) From my very non-scientific survey, memory errors seem to happen significantly less often than this paper would lead you to believe. Then again, running ECC in the first place indicates better hardware than non-ECC, so that's a likely bias.

评论 #33528002 未加载

erik超过 2 年前

Has anyone tried using software to measure bit-flip rates on non-ECC systems? It seems like a pretty easy task. Turn off swap. Fill a bunch of memory with a known pattern. Every few hours read all the memory and verify that no bits were flipped. If the 2009 result holds on modern systems and a gigabyte of DRAM flips a bit every few hours, then evidence should show up pretty quickly.

评论 #33528644 未加载

bugfix-66超过 2 年前

<a href="https://pqsrc.cr.yp.to/libsecded-20220828/INTERNALS.html" rel="nofollow">https://pqsrc.cr.yp.to/libsecded-20220828/INTERNALS.html</a>libsecded encodes an n-byte array using an extended Hamming code on the bottom bit of each byte, in parallel an extended Hamming code on the next bit of each byte, etc.<a href="https://en.m.wikipedia.org/wiki/Hamming_code" rel="nofollow">https://en.m.wikipedia.org/wiki/Hamming_code</a>Extended Hamming codes achieve a Hamming distance of four, which allows the decoder to distinguish between when at most one one-bit error occurs and when any two-bit errors occur. In this sense, extended Hamming codes are single-error correcting and double-error detecting, abbreviated as SECDED.The main idea is to choose the error-correcting bits such that the index-XOR (the XOR of all the bit positions containing a 1) is 0. We use positions 1, 10, 100, etc. (in binary) as the error-correcting bits, which guarantees it is possible to set the error-correcting bits so that the index-XOR of the whole message is 0. If the receiver receives a string with index-XOR 0, they can conclude there were no corruptions, and otherwise, the index-XOR indicates the index of the corrupted bit.Hamming codes have a minimum distance of 3, which means that the decoder can detect and correct a single error, but it cannot distinguish a double bit error of some codeword from a single bit error of a different codeword. Thus, some double-bit errors will be incorrectly decoded as if they were single bit errors and therefore go undetected, unless no correction is attempted.To remedy this shortcoming, Hamming codes can be extended by an extra parity bit. This way, it is possible to increase the minimum distance of the Hamming code to 4, which allows the decoder to distinguish between single bit errors and two-bit errors. Thus the decoder can detect and correct a single error and at the same time detect (but not correct) a double error.

rtpg超过 2 年前

I do wonder how many bits in RAM really are "harmlessly flippable". If I took a snapshot of a running machine, how safe is that flip from landing somewhere bad? Perhaps a lot of stuff ends up being write only so fine?

评论 #33529616 未加载

dale_glass超过 2 年前

I'm not sure how useful this is, because memory interacts with pretty much everything.I mean, great: you've validated that the important financial data you were going to write to the DB is correct. But you didn't validate that the OS itself is in full working order. A bit goes out of place, the kernel writes something weird to disk, filesystem becomes corrupted and things explode in a dramatic fashion.That's exactly why I try to get ECC everywhere these days. I had an old box serving firewall duty until one day it died because it got bumped, a memory module got loose somehow and the resulting disk corruption rendered it unbootable. Applications verifying that their data is correct wouldn't have changed anything.

jedisct1超过 2 年前

GitHub mirror, since there doesn't seem to be a proper tarball: <a href="https://github.com/jedisct1/libsecded" rel="nofollow">https://github.com/jedisct1/libsecded</a>This also adds a cross-platform build script.

评论 #33532353 未加载

segfaultbuserr超过 2 年前

Similar software error-checking techniques are often used in embedded systems. External electromagnetic interference can cause program counter, register and memory corruptions, but hardening the hardware is often prohibitively expensive. When the reliability requirements are not too high, redundant software checks are often a solution - the goal is not to eliminate all failures, but to reduce their probability.The now-deleted (due to lack of citations) Wikipedia article Immunity-aware programming [0] was a good overview of this topic. Relevant techniques includes:> Token passing: Every function is tagged with a unique function ID. When the function is called, the function ID is saved in a global variable. The function is only executed if the function ID in the global variable and the ID of the function match. If the IDs do not match, an instruction pointer error has occurred, and specific corrective actions can be taken. [...] This is essentially an "arm / fire" sequencing, for every function call. Requiring such a sequence is part of safe programming techniques, as it generates tolerance for single bit (or in this case, stray instruction pointer) faults.> Data duplication: To cope with corruption of data, multiple copies of important registers and variables can be stored. Consistency checks between memory locations storing the same values, or voting techniques, can then be performed when accessing the data. [...] When the data is read out, the two sets of data are compared. A disturbance is detected if the two data sets are not equal. An error can be reported. If both sets of data are corrupted, a significant error can be reported and the system can react accordingly.> [...] CRCs are calculated before and after transmission or duplication, and compared to confirm that they are equal. A CRC detects all one- or two-bit errors, all odd errors, all burst errors if the burst is smaller than the CRC, and most of the wide-burst errors. Parity checks can be applied to single characters (VRC—vertical redundancy check), resulting in an additional parity bit or to a block of data (LRC—longitudinal redundancy check), issuing a block check character. Both methods can be implemented rather easily by using an XOR operation. A trade-off is that less errors can be detected than with the CRC. Parity Checks only detect odd numbers of flipped bits. The even numbers of bit errors stay undetected. A possible improvement is the usage of both VRC and LRC, called Double Parity or Optimal Rectangular Code (ORC).> Function parameter duplication: Parameters passed to procedures, as well as return values, are considered to be variables. Hence, every procedure parameter is duplicated, as well as the return values. A procedure is still called only once, but it returns two results, which must hold the same value. The source listing to the right shows a sample implementation of function parameter duplication.> Test/branch duplication: To duplicate a [if-else] test at multiple locations in the program. [...] For every conditional test in the program, the condition and the resulting jump should be reevaluated, as shown in the figure. Only if the condition is met again, the jump is executed, else an error has occurred.None of the mainstream compiler has these features, often programmers do all of these tasks by hand (!) in C. If someone implements these kinds of features to GCC or LLVM/clang (similar to how buffer overflow exploits are mitigated by automatic stack canary or Control-Flow Integrity checks), it would be a major contribution to the entire world of embedded system development.[0] <a href="https://web.archive.org/web/20180519034600/https://en.wikipedia.org/wiki/Immunity-aware_programming" rel="nofollow">https://web.archive.org/web/20180519034600/https://en.wikipe...</a>

评论 #33617047 未加载

评论 #33534003 未加载

CalChris超过 2 年前

Isn't LPDDR5 in the M2 supporting ECC? I believe it corrects errors but doesn't report them, no?

评论 #33528316 未加载

评论 #33527920 未加载

throwaway81523超过 2 年前

For a large array maybe you are better off with e.g. a Reed-Solomon code instead of a Hamming code.

9 条评论

viraptor超过 2 年前

评论 #33528002 未加载

erik超过 2 年前

评论 #33528644 未加载

bugfix-66超过 2 年前

rtpg超过 2 年前

评论 #33529616 未加载

dale_glass超过 2 年前

jedisct1超过 2 年前

评论 #33532353 未加载

segfaultbuserr超过 2 年前

评论 #33617047 未加载

评论 #33534003 未加载

CalChris超过 2 年前

Isn't LPDDR5 in the M2 supporting ECC? I believe it corrects errors but doesn't report them, no?

评论 #33528316 未加载

评论 #33527920 未加载

throwaway81523超过 2 年前

For a large array maybe you are better off with e.g. a Reed-Solomon code instead of a Hamming code.

Libsecded

9 条评论

Libsecded

9 条评论