> We wrote a one-off tool to go through our GCS buckets and delete all the files that didn’t have an entry in the manifest.<p>...<p>> Specifically, we had to ensure two things — every file removed from the manifest has an entry in BadgerDB, and no file locator that resides in BadgerDB is present in the manifest.<p>This feels a little weird to read, because I've been working with things which are moving in the exact opposite direction right now.<p>So we had Hadoop S3Guard[1] which kept track of deleted files in a bucket so that we would detect a delete + read race before S3 had strong consistency. And that's stored in Dynamodb, which is very much a KV store and it was somewhat of a nightmare to keep track of these things.<p>Moving onto Apache Iceberg which has roughly the same design discussed here (file based manifests + orphaned files for failed commits) and we're going there because storing data in a standalone metadata service is getting a bit old-tech (the Hive ACIDv2 keeps this info as number sequences, which is very postgres-like, but needs an FS listing to start off).<p>So bit by bit, we're moving from a KVStore to file-manifests to make systems more scalable. In that context, a problem like this exists very clearly in my future and I wonder if there's a better way to prevent it than going back to a kv-store model again (particularly when the manifests can fork into a tree thanks to data-sharing with snapshots).<p>[1] - <a href="https://www.slideshare.net/hortonworks/s3guard-whats-in-your-consistency-model/12" rel="nofollow">https://www.slideshare.net/hortonworks/s3guard-whats-in-your...</a>