I tested four NVMe SSDs from four vendors – half lose FLUSH'd data on power loss (2022)

262 pointsby whitepoplarover 1 year ago

31 comments

r1chover 1 year ago

We shipped a shader cache in the latest release of OBS and quickly had reports come in that the cached data was invalid. After investigating, the cache files were the correct size on disk but the contents were all zero. On a journaled file system this seems like it should be impossible, so the current guess is that some users have SSDs that are ignoring flushes and experience data corruption on crash / power loss.

评论 #38374837 未加载

评论 #38374760 未加载

评论 #38377215 未加载

评论 #38377295 未加载

评论 #38377380 未加载

ricardobeatover 1 year ago

Misleading headline since after testing eight more drives, none more failed.2/12 is not nearly as dramatic as “half”, and the ones that lost data are the cheap brands as one would expect.

评论 #38374597 未加载

评论 #38375808 未加载

评论 #38378364 未加载

评论 #38375188 未加载

评论 #38376725 未加载

IngvarLynnover 1 year ago

There is a flood of fake SSDs currently, mostly big brands. I've recently purchased counterfeit 1TB. It passes all the tests, performance is ok, it works... except it gets episodes where ioping would be anything between 0.7 ms and 15 seconds, that is under zero load. And these are quality fakes from a physical appearance perspective. The only way I could tell mine was fake is that the official Kingston firmware update tool would not recognize this drive.

评论 #38374430 未加载

评论 #38374604 未加载

评论 #38377310 未加载

评论 #38375360 未加载

评论 #38374341 未加载

评论 #38374753 未加载

kristopolousover 1 year ago

Under long term heavy duty, I've routinely seen cheap modern platter outperform cheap brand name NVME.There's some cost cutting somewhere. The NVMEs can't seem to sustain throughput.It's been pretty disappointing to move I/O bound workloads over and not see notable improvements. The magnitude of data I'm talking about is 500-~3000GBI've only got two NVME machines for what I'm doing so I'll gladly accept that it's coincidentally flaky bus hardware on two machines, but I haven't been impressed except for the first few seconds.I know Everyone says otherwise which is why I brought it up. Someone tell me why I'm crazyEdit: no, I'm not crazy. <a href="https://htwingnut.com/2022/03/06/review-leven-2tb-2-5-sata-ssd/" rel="nofollow noreferrer">https://htwingnut.com/2022/03/06/review-leven-2tb-2-5-sata-s...</a> this is similar to what I'm seeing with Crucial and Adata hardware, almost binary performance

评论 #38377512 未加载

评论 #38377120 未加载

评论 #38380860 未加载

评论 #38377126 未加载

评论 #38387580 未加载

kmxdmover 1 year ago

Writes are completed to the host when they land on the SSD controller, not when written to Flash. The SSD controller has to accumulate enough data to fill its write unit to Flash (the absolute minimum would be a Flash page, typically 16kB). If it waited for the write to Flash to send a completion, the latency would be unbearable. If it wrote every write to Flash as quickly as possible, it could waste much of the drive's capacity padding Flash pages. If a host tried to flush after every write to force the latter behavior, it would end up with the same problem. Non-consumer drives solve the problem with back-up capacitance. Consumer drives do not have this. Also, if the author repeated this test 10 or 100 times on each drive, I suspect that he would uncover a failure rate for each consumer drive. It's a game of chance.

评论 #38378203 未加载

评论 #38378224 未加载

评论 #38378220 未加载

评论 #38378210 未加载

loloquwowndueoover 1 year ago

Twitter yuk, can somebody just post the names of the four tested drives and which passed/failed please?

评论 #38373675 未加载

评论 #38376932 未加载

评论 #38373956 未加载

评论 #38373921 未加载

评论 #38373341 未加载

评论 #38376572 未加载

CoastalCoderover 1 year ago

Does advertising a product as adhering to some standard, but secretly knowing that it doesn't 100%, count as e.g. fraud? I.e., is there any established case law on the matter?I'm thinking of this example, but also more generally USB devices, Bluetooth devices, etc.

评论 #38376262 未加载

评论 #38374232 未加载

评论 #38374176 未加载

评论 #38375096 未加载

评论 #38375126 未加载

评论 #38376765 未加载

评论 #38375693 未加载

评论 #38375196 未加载

handednessover 1 year ago

Previous Discussion: <a href="https://news.ycombinator.com/item?id=30419618">https://news.ycombinator.com/item?id=30419618</a>

sashkover 1 year ago

This is (2022).Wondering if anything changed since the original tests...

评论 #38373348 未加载

arglebargle123over 1 year ago

Meanwhile I'm over here jamming Micron 7450 pros into my work laptop for better sync write performance.I have very little trust in consumer flash these days after seeing the firmware shortcuts and stealth hardware replacements manufacturers resort to to cut costs.

评论 #38379847 未加载

jauntywundrkindover 1 year ago

Losing flushes is obviously bad.I wonder how much perf is on the table in various scenarios when we can give up needing to flush. If you know the drive has some resilience, say, 0.5s of time it can safely writeback during, maybe you can give up flushes (in some cases). How much faster is the app then?It's be neat to see some low-cost improvements here. Obviously in most cases, just get an enterprise drive with supercapa or batteries onboard. But an ATX power rail that has extra resilience from the supply, or an add-in/pass-through 6-pin sata power supercap... that could be useful too.

评论 #38373810 未加载

评论 #38374030 未加载

lxgrover 1 year ago

I guess it's time for `fsync_but_really_actually_sync_it_please(2)` (and the lower level equivalents in SATA, NVMe etc.)?

评论 #38373382 未加载

评论 #38372297 未加载

tripdoutover 1 year ago

Flushing in this case is from the SSDs internal DRAM cache to the actual NAND flash?

评论 #38375031 未加载

评论 #38381435 未加载

nijaveover 1 year ago

It'd be nice if there were a database of known bad/known good hardware to reference. I know there's been some spreadsheets and special purpose like the USB-C cables Benson Leung tested.Especially for consumer hardware on Linux--there's a lot of stuff that "works" but is not necessarily stable long term or that required a lot of hacking on the kernel side to work around issues

AlbertoGPover 1 year ago

Well, yes, but which were those 2 out of 4 vendors?

评论 #38373321 未加载

评论 #38371462 未加载

caycepover 1 year ago

The model I’d be interested in would be the SK Hynix/Solidigm P44 Pro, as that model competes w the Samsung 9xx evo and pro models

dtx1over 1 year ago

I am a bit annoyed that everyone here takes this at face value. There's 0 evidence given, not even the vendors and models are named to confirm this.On a related note I tested 4 DDR5 Ram kits from major vendors - half of them corrupt data when exposed to UV light.

39over 1 year ago

This has always been the case? At least it was a course learning when we wrote our own device drivers for minux, even the controllers on spinning metal fib about flush.

naaskingover 1 year ago

At this point, any storage vendor should be required to pass the Sqlite test suite before they can sell their product.

caycepover 1 year ago

Also…would modern journaling file systems protect against this sort of data loss?

评论 #38376046 未加载

pleoxyover 1 year ago

If you need PLP use an enterprise drive. That's what they're for.

评论 #38375539 未加载

评论 #38379748 未加载

评论 #38374795 未加载

评论 #38375812 未加载

Joel_Mckayover 1 year ago

Cheap drives don't include large dram caches, lack fast SLC areas, and leave off super-capacitors that allow chips to drain buffers during a power-failure."Buy cheap, buy twice" as they say... =)

pajkoover 1 year ago

Without any more information this post is just bullshit. For example, it's not documented how the flushing has been done. On Linux, even issuing 'sync' is not enough: <a href="https://unix.stackexchange.com/questions/98568/difference-between-blockdev-flushbufs-and-sync-on-linux" rel="nofollow noreferrer">https://unix.stackexchange.com/questions/98568/difference-be...</a>The bottom answer especially states that "blockdev --flushbufs may still be required if there is a large write cache and you're disconnecting the device immediately after"The hpdarm utility has a parameter for syncing and flushing device buffers themselves. Seems like all three should be done for a complete flush at all levels.

nik736over 1 year ago

That's what PLP is for.

tw1984over 1 year ago

Don't use home-grade SSDs for storing anything that is considered critical.The rule is not that hard to remember.

martincmartinover 1 year ago

2022

jbverschoorover 1 year ago

Brands please. It’s time they have some pressure to fix these data corruption issues

xbmcuserover 1 year ago

this is from Feb 2022

java-manover 1 year ago

Name the offenders please.I am sure it might be easy to see visually - a lack of substantial capacitor on the board would indicate a high likelihood of data loss.

评论 #38371456 未加载

评论 #38371989 未加载

评论 #38372320 未加载

babbermanover 1 year ago

That is unfortunate, but I guess those SSDs performed really well and outclassed all others in performance benchmarks? lol

hackerfactor1over 1 year ago

The posting is from Feb 2022, nearly 2 years ago. How is this suddenly trending on Hacker News?

评论 #38373776 未加载

31 comments

r1chover 1 year ago

评论 #38374837 未加载

评论 #38374760 未加载

评论 #38377215 未加载

评论 #38377295 未加载

评论 #38377380 未加载

ricardobeatover 1 year ago

Misleading headline since after testing eight more drives, none more failed.2/12 is not nearly as dramatic as “half”, and the ones that lost data are the cheap brands as one would expect.

评论 #38374597 未加载

评论 #38375808 未加载

评论 #38378364 未加载

评论 #38375188 未加载

评论 #38376725 未加载

IngvarLynnover 1 year ago

评论 #38374430 未加载

评论 #38374604 未加载

评论 #38377310 未加载

评论 #38375360 未加载

评论 #38374341 未加载

评论 #38374753 未加载

kristopolousover 1 year ago

评论 #38377512 未加载

评论 #38377120 未加载

评论 #38380860 未加载

评论 #38377126 未加载

评论 #38387580 未加载

kmxdmover 1 year ago

评论 #38378203 未加载

评论 #38378224 未加载

评论 #38378220 未加载

评论 #38378210 未加载

loloquwowndueoover 1 year ago

Twitter yuk, can somebody just post the names of the four tested drives and which passed/failed please?

评论 #38373675 未加载

评论 #38376932 未加载

评论 #38373956 未加载

评论 #38373921 未加载

评论 #38373341 未加载

评论 #38376572 未加载

CoastalCoderover 1 year ago

评论 #38376262 未加载

评论 #38374232 未加载

评论 #38374176 未加载

评论 #38375096 未加载

评论 #38375126 未加载

评论 #38376765 未加载

评论 #38375693 未加载

评论 #38375196 未加载

handednessover 1 year ago

Previous Discussion: <a href="https://news.ycombinator.com/item?id=30419618">https://news.ycombinator.com/item?id=30419618</a>

sashkover 1 year ago

This is (2022).Wondering if anything changed since the original tests...

评论 #38373348 未加载

arglebargle123over 1 year ago

评论 #38379847 未加载

jauntywundrkindover 1 year ago

评论 #38373810 未加载

评论 #38374030 未加载

lxgrover 1 year ago

I guess it's time for `fsync_but_really_actually_sync_it_please(2)` (and the lower level equivalents in SATA, NVMe etc.)?

评论 #38373382 未加载

评论 #38372297 未加载

tripdoutover 1 year ago

Flushing in this case is from the SSDs internal DRAM cache to the actual NAND flash?

评论 #38375031 未加载

评论 #38381435 未加载

nijaveover 1 year ago

AlbertoGPover 1 year ago

Well, yes, but which were those 2 out of 4 vendors?

评论 #38373321 未加载

评论 #38371462 未加载

caycepover 1 year ago

The model I’d be interested in would be the SK Hynix/Solidigm P44 Pro, as that model competes w the Samsung 9xx evo and pro models

dtx1over 1 year ago

39over 1 year ago

This has always been the case? At least it was a course learning when we wrote our own device drivers for minux, even the controllers on spinning metal fib about flush.

naaskingover 1 year ago

At this point, any storage vendor should be required to pass the Sqlite test suite before they can sell their product.

caycepover 1 year ago

Also…would modern journaling file systems protect against this sort of data loss?

评论 #38376046 未加载

pleoxyover 1 year ago

If you need PLP use an enterprise drive. That's what they're for.

评论 #38375539 未加载

评论 #38379748 未加载

评论 #38374795 未加载

评论 #38375812 未加载

Joel_Mckayover 1 year ago

Cheap drives don't include large dram caches, lack fast SLC areas, and leave off super-capacitors that allow chips to drain buffers during a power-failure."Buy cheap, buy twice" as they say... =)

pajkoover 1 year ago

nik736over 1 year ago

That's what PLP is for.

tw1984over 1 year ago

Don't use home-grade SSDs for storing anything that is considered critical.The rule is not that hard to remember.

martincmartinover 1 year ago

2022

jbverschoorover 1 year ago

Brands please. It’s time they have some pressure to fix these data corruption issues

xbmcuserover 1 year ago

this is from Feb 2022

java-manover 1 year ago

Name the offenders please.I am sure it might be easy to see visually - a lack of substantial capacitor on the board would indicate a high likelihood of data loss.

评论 #38371456 未加载

评论 #38371989 未加载

评论 #38372320 未加载

babbermanover 1 year ago

That is unfortunate, but I guess those SSDs performed really well and outclassed all others in performance benchmarks? lol

hackerfactor1over 1 year ago

The posting is from Feb 2022, nearly 2 years ago. How is this suddenly trending on Hacker News?

评论 #38373776 未加载