Random bit-flip invalidates certificate transparency log – again?

141 pointsby nickfabout 2 years ago

18 comments

gorgoilerabout 2 years ago

The numbers from this Google SIGMETRICS09 paper are my usual benchmark for thinking about ECC DIMMs:<a href="https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf" rel="nofollow">https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf</a>Their metric of “25,000 to 70,000 errors per billion device hours per megabit” is a bit hard to grapple with. If you assume each error is a single bit then that’s 20 to 50 bytes per GB DIMM per month, or one bit per GB every two hours.

评论 #35884450 未加载

评论 #35884498 未加载

评论 #35884789 未加载

评论 #35884800 未加载

评论 #35891423 未加载

TanjBabout 2 years ago

DRAM is not greatly affected by radiation, because the capacitors are large structures relative to radiation events. SRAM is affected, which is why SRAM arrays should always use SECDED ECC.The dominant cause of DRAM failures is bit flips from variable retention time (VRT), where the cell fails to hold charge long enough to meet refresh timing. These are believed to be caused by stray charge trapped in gate dielectric, a bit like an accidental NAND, and they can persist for days to months. This is why the latest generation (LPDDR4x, LP/DDR5) have single bit correction built into the DRAM chip. Along with permanent single cell failures due to aging this probably fixes more than 95% of DRAM faults.The DRAM vendors sure could do a lot better on publishing error statistics. They are probably the least transparent critical technology used in everything, but no regulation requires them to explain and they generally refuse statistics on faults even to major customers (which is why folks like AMD run large experiments at supercomputer sites to investigate, and most clouds gather their own data).That said, DRAM chips are pretty good. The DDR4 generation probably had better than a 1000 FIT rate per 2GB chip, so in a laptop with 16GB that would have been less than 10 error per million hours, or under 1 per 50 laptops used for a year.For many of us the vast majority of data is in media files. I personally notice broken photos and videos every now and then. I would love to have a laptop with a competent ECC level, but they do not exist. Even desktop servers often come without. It is unclear how much better the LP/DDR5 generation will be since the on-die ECC still does not fix higher order faults in word lines and other shared structures, which may sum to as much as 10% of aging faults. All simply educated guesses, since the industry will not publish.

评论 #35891503 未加载

评论 #35891526 未加载

amlutoabout 2 years ago

It seems to me that CT (or its operators?) should take a lesson from adversarial blockchains (cryptocurrency) here: a new state should not be propagated without verification.I think that, for CT, this should be fairly straightforward. Some machine with access to the signing keys should generate new nodes and signatures and push those internally to some front-end machines. The latter (on separate physical machines) should fully validate the result before propagating it any farther. No outside user sees the result until at least, say, 3 machines fully validate it. Then, if validation fails, the state could be rolled back internally.When a rollback occurs, the signing machine would think it’s signing a new, conflicting state, but that’s fine: no one outside the log operator has seen the old conflicting state.

评论 #35887612 未加载

nickfabout 2 years ago

This happened before a couple of years ago, and the problem seems to have repeated itself. <a href="https://news.ycombinator.com/item?id=27728287" rel="nofollow">https://news.ycombinator.com/item?id=27728287</a>

评论 #35885168 未加载

评论 #35884810 未加载

评论 #35885018 未加载

benlivengoodabout 2 years ago

Do some folks still not validate new entries in their certificate transparency logs on at least one other machine before publishing them?This is getting to the point (2 log failures in just under 2 years) that I wouldn't be surprised to see some certificates invalidated because they only used 2 transparency logs and both failed within the lifetime of the cert.

chunk_waffleabout 2 years ago

We don't take bit flips seriously enough, practically every consumer device uses non ECC memory, very few folks use filesystems (e.g. ZFS) that can detect corrupt blocks. Even when using those things together it's still not perfect.Everything is terrible.

评论 #35888553 未加载

Thorrezabout 2 years ago

>Unfortunately, it is not possible for the log to recover from this.That sounds bad. What does this mean? Does the entire log need to be thrown out, and a new log needs to be create to start from scratch?

评论 #35884506 未加载

peter_d_shermanabout 2 years ago

If there might be a software remedy (or at least amelioration) of the hardware problem of random bit flips affecting a software data structure -- it might involve creating multiple (more than 1, 2 or greater) redundant data structures in memory -- and checking each one for consistency against the others at specific intervals...If the random bit flips affect code -- then if the code is deterministic, then one solution (or at least amelioration) might be running multiple copies of the same code -- but loaded at different memory locations and where the results of one set of code's calculations are checked against the results of the same set of code's calculations -- but loaded and executed from a different memory location...Kludgy? Yes -- but if the underlying hardware is buggy (random bit errors which cannot be removed for whatever reason) -- then it may be the only effective way to make the system work, despite the kludgyness...Which brings up a strictly academic question -- what would an OS where each OS data structure and OS code path was replicated/redundant (and the results of running each redundant code path / data structure and the results checked at various intervals) -- look like?(I know NASA did something like that a long time ago with using something like five redundant computers where each computer checks the results of the computation of the group, and if there's an inconsistency, the computer producing the inconsistent result would be shut down...)Related: <a href="https://history.nasa.gov/computers/Ch5-5.html" rel="nofollow">https://history.nasa.gov/computers/Ch5-5.html</a>

aeaa3about 2 years ago

Do we know whether the machine on which this occurred had ECC memory?

评论 #35886918 未加载

dborehamabout 2 years ago

My lifetime experience says to suspect software did this (I've had careers in both hardware design, designing large memory subsystems, and in software development). Yes it's one bit changed which makes the mind go to the ever present alpha particle, but code also flips single bits. If some library code inside a process generating this data wanted to update a bitmap structure but got the address wrong, you'd have the same outcome.

评论 #35888624 未加载

评论 #35888527 未加载

评论 #35890160 未加载

molticrystalabout 2 years ago

Reminds me of BitSquatting where cosmic rays, hardware, or other errors flip a bit in a domain name and an advantage you can gain by purchasing bitflipped domains.>Over the course of about seven months, 52,317 requests were made to the bitsquat domains [0][0] <a href="https://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinaburg_Bitsquatting_WP.pdf" rel="nofollow">https://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabu...</a>

评论 #35885654 未加载

Animatsabout 2 years ago

This is perhaps one of the few legitimate use cases for a distributed blockchain. Then several nodes have to agree for the chain to advance.

评论 #35884895 未加载

评论 #35884898 未加载

评论 #35884847 未加载

politelemonabout 2 years ago

It's on the 4th line where it says00000030: 9126 9384 ....Instead of 9284 ....

ransackdevabout 2 years ago

Relevant video from Veritasium on cosmic rays flipping bits and causing chaos<a href="https://www.youtube.com/watch?v=AaZ_RSt0KP8">https://www.youtube.com/watch?v=AaZ_RSt0KP8</a>

bombcarabout 2 years ago

With things like RAID and Reed-Solomon codes, we have the ability to have verifiably correct data even with some percentage lost; how come something like this isn't used?

zinekellerabout 2 years ago

Wait, it's DigiCert's again? (Previous: <a href="https://groups.google.com/a/chromium.org/g/ct-policy/c/PCkKU357M2Q/" rel="nofollow">https://groups.google.com/a/chromium.org/g/ct-policy/c/PCkKU...</a>) Do we have a list of all failed CTs?

评论 #35887128 未加载

teaearlgraycoldabout 2 years ago

Interestingly, Google has a team of engineers dedicated to detecting hardware prone to these kinds of errors by perpetually QAing devices in the field. You can then replace the node before it breaks something important.

评论 #35885001 未加载

sushidevabout 2 years ago

what happened here?

评论 #35884412 未加载