TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Saving $30k a month by improving garbage collection

56 点作者 i0exception将近 4 年前

2 条评论

gopalv将近 4 年前
&gt; We wrote a one-off tool to go through our GCS buckets and delete all the files that didn’t have an entry in the manifest.<p>...<p>&gt; Specifically, we had to ensure two things — every file removed from the manifest has an entry in BadgerDB, and no file locator that resides in BadgerDB is present in the manifest.<p>This feels a little weird to read, because I&#x27;ve been working with things which are moving in the exact opposite direction right now.<p>So we had Hadoop S3Guard[1] which kept track of deleted files in a bucket so that we would detect a delete + read race before S3 had strong consistency. And that&#x27;s stored in Dynamodb, which is very much a KV store and it was somewhat of a nightmare to keep track of these things.<p>Moving onto Apache Iceberg which has roughly the same design discussed here (file based manifests + orphaned files for failed commits) and we&#x27;re going there because storing data in a standalone metadata service is getting a bit old-tech (the Hive ACIDv2 keeps this info as number sequences, which is very postgres-like, but needs an FS listing to start off).<p>So bit by bit, we&#x27;re moving from a KVStore to file-manifests to make systems more scalable. In that context, a problem like this exists very clearly in my future and I wonder if there&#x27;s a better way to prevent it than going back to a kv-store model again (particularly when the manifests can fork into a tree thanks to data-sharing with snapshots).<p>[1] - <a href="https:&#x2F;&#x2F;www.slideshare.net&#x2F;hortonworks&#x2F;s3guard-whats-in-your-consistency-model&#x2F;12" rel="nofollow">https:&#x2F;&#x2F;www.slideshare.net&#x2F;hortonworks&#x2F;s3guard-whats-in-your...</a>
评论 #28150249 未加载
ckdarby将近 4 年前
Midway through I realized they&#x27;re reinventing solved problems already.<p>Mixpanel should take a look at Apache Iceberg for your writing and Apache Pulsar to keep your cost lower by not needing to keep 7 day retention in your pipeline once the messages are ack&#x27;ed by all consumers.<p>For replaying you can use Trino to just read from your Iceberg and insert back into your stream.
评论 #28152721 未加载