Binance built a 100PB log service with Quickwit

228 点作者 samber10 个月前

18 条评论

A word of caution here: This is very impressive, but almost entirely wrong for your organisation.Most log messages are useless 99.99% of the time. Best likely outcome is that its turned into a metric. The once in the blue moon outcome is that it tells you what went wrong when something crashed.Before you get to shipping _petabytes_ of logs, you really need to start thinking in metrics. Yes, you should log errors, you should also make sure they are stored centrally and are searchable.But logs shouldn't be your primary source of data, metrics should be.things like connection time, upstream service count, memory usage, transactions a second, failed transactions, upsteam/downstream end point health should all be metrics emitted by your app(or hosting layer), directly. Don't try and derive it from structured logs. Its fragile, slow and fucking expensive.comparing, cutting and slicing metrics across processes or even services is simple, with logs its not.

评论 #40937496 未加载

评论 #40940755 未加载

评论 #40939436 未加载

评论 #40937803 未加载

评论 #40937181 未加载

评论 #40939149 未加载

评论 #40939001 未加载

评论 #40938386 未加载

评论 #40937337 未加载

评论 #40940772 未加载

评论 #40939638 未加载

评论 #40942090 未加载

评论 #40939342 未加载

评论 #40939539 未加载

评论 #40939389 未加载

评论 #40937900 未加载

评论 #40939057 未加载

评论 #40938332 未加载

ZeroCool2u10 个月前

There was a time at the beginning of the pandemic where my team was asked to build a full text search engine on top of a bunch of SharePoint sites in under 2 weeks and with frustratingly severe infrastructure constraints, (No cloud services, single box on prem for processing, among other things), and we did and it served its purpose for a few years. Absolutely no one should emulate what we built, but it was an interesting puzzle to work on and we were able to cut through a lot of bureaucracy quickly that had held us back for a few years wrt accessing the sensitive data they needed to search.But I was always looking for other options for rebuilding the service within those constraints and found Quickwit when it was under active development. I really admire their work ethic and their engineering. Beautifully simple software that tends to Just Work™. It's also one of the first projects that made me really understand people's appreciation for Rust as well outside of just loving Cargo.

评论 #40938988 未加载

评论 #40937584 未加载

评论 #40937254 未加载

randomtoast10 个月前

I wonder how much their setup costs. Naively, if one were to simply feed 100 PB into Google BigQuery without any further engineering efforts, it would cost about 3 million USD per month.

评论 #40936981 未加载

评论 #40936833 未加载

评论 #40936416 未加载

评论 #40936779 未加载

评论 #40936565 未加载

评论 #40936662 未加载

piterrro10 个月前

Reminds me of the time Coinbase paid DataDog $65M for storing logs[1][1] <a href="https://thenewstack.io/datadogs-65m-bill-and-why-developers-should-care/" rel="nofollow">https://thenewstack.io/datadogs-65m-bill-and-why-developers-...</a>

AJSDfljff10 个月前

Unfortunate the interesting part is missing.Its not hard at all to scale to PB. Junk your data based on time, scale horizontally. When you can scale horizontally it doesn't matter how much it is.Elastic is not something i would use for scaling horizontally basic logs, i would use it for live data which i need live with little latency or if i do constantly a lot of log analysis live again.Did Binance really needed elastic or did they just start pushing everything into elastic without every looking left and right?Did they do any log processing and cleanup before?

评论 #40936515 未加载

endorphine10 个月前

What would you use for storing and querying long-term audit logs (e.g. 6 months retention), which should be searchable with subsecond latency and would serve 10k writes per second?AFAICT this system feels like a decent choice. Alternatives?

评论 #40937026 未加载

评论 #40936816 未加载

评论 #40936988 未加载

评论 #40940470 未加载

评论 #40936788 未加载

RIMR10 个月前

I am having trouble understand how any organization could ever need a collection of logs larger than the size of the entire Internet Archive. 100PB is staggering, and the idea of filling that with logs, while entirely possible, just seems completely useless given the cost of managing that kind of data.This is on a technical level quite impressive though, don't get me wrong, I just don't understand the use case.

评论 #40939116 未加载

评论 #40938112 未加载

ram_rar10 个月前

>Limited Retention: Binance was retaining most logs for only a few days. Their goal was to extend this to months, requiring the storage and management of 100 PB of logs, which was prohibitively expensive and complex with their Elasticsearch setup.Just to give some perspective. The Internet Archive, as of January 2024, attests to have stored ~ 99 petabytes of data.Can someone from Binance/quickwit comment on their use case that needed log retention for months? I have rarely seen users try to access actionable _operations_ log data beyond 30 days.I wonder how much $$ can they save more by leveraging tiered storage and engs being mindful of logging.

评论 #40939705 未加载

kstrauser10 个月前

How? If Binance had a trillion transactions, that’s 100KB per transaction. What all are they logging?

评论 #40939104 未加载

评论 #40937452 未加载

评论 #40937017 未加载

评论 #40938083 未加载

sebstefan10 个月前

> On a given high-throughput Kafka topic, this figure goes up to 11 MB/s per vCPU.There's got to be 2x to 10x improvement to be made there, no? No way CPU is the limitation these days and even bad hard drives will support 50+mB/s write speeds.

评论 #40937210 未加载

评论 #40937072 未加载

jiggawatts10 个月前

The article talks about 1.6 PB / day, which is 150 Gbps of log ingest traffic sustained. That's insane.A change of the logging protocol to a more efficient format would yield such a huge improvement that it would be much cheaper than this infrastructure engineering exercise.I suspect that everyone just assumes that the numbers represent the underling data volume, and that this cannot be decreased. Nobody seems to have heard of write amplification.Let's say you want to collect a metric. If you do this with a JSON document format you'd likely end up ingesting records that are like the following made-up example:<pre><code> { "timestamp": "2024-07-12T14:30:00Z", "serviceName": "user-service-ba34sd4f14", "dataCentre": "eastFoobar-1", "zone": 3, "cluster: "stamp-prd-4123", "instanceId": "instance-12345", "object": "system/foo/blargh", "metric: "errors", "units": "countPerMicroCentury", "value": 391235.23921386 } </code></pre> It wouldn't surprise me if in reality this was actually 10x larger. For example, just the "resource id" of something in Azure is about this size, and it's just one field of many collected by every logging system in that cloud for every record. Similarly, I've cracked open the protocols and schema formats for competing systems and found 300x or worse write amplification being the typical case.The actual data that needed to be collected was just:<pre><code> 391235.23921386 </code></pre> In a binary format that would be 4 bytes, 8 if you think that you need to draw your metric graphs with a vertical precision of a millionth of a pixel and horizontal precision of a minute because you can't afford the exabytes of storage a higher collection frequency would require.If you collect 4 bytes per metric in an array and record the start timestamp and the interval, you don't even need a timestamp per entry, just one per thousand or whatever. For a metric collected every second that's just 10 MB per month before compression. Most metrics change slowly or not at all and would compress down to mere kilobytes.

elchief10 个月前

maybe drop the log level from debug to info...

inssein10 个月前

Why do so many companies insist on shipping their logs via Kafka? I can't imagine deliverability semantics are necessary with logs, and if they are, they shouldn't be in your logs?

评论 #40938947 未加载

评论 #40938197 未加载

评论 #40945810 未加载

评论 #40938032 未加载

ukuina10 个月前

Does QuickWit support regex search now? The underlying store, Tantivy, already does.This is what stopped a PoC cold at an earlier project.

评论 #40942199 未加载

cletus10 个月前

Just browsing the Quickwit documentation it seems like the general architecture here is to write JSON logs but stores them compressed. Is this just something like gzip compression? 20% compressed size does seem to align to ballpark estimates of JSON GZIP compression. This is what Quickwit (and this page) calls a "document": a single JSON record (just FYI).Additionally you need to store indices because this is what you actually search. Indices have a storage cost when you write them too.When I see a system like this my thoughts go to questions like:- What happens when you alter an index configuration? Or add or remove an index?- How quickly do indexes update when this happens?- What about cold storage?Data retention is another issue. Indexes have config for retention [1]. It's not immediately clear to me how document retention works, possibly from S3 expiration?So, network transfer from S3 is relatively expensive ($0.05/GB standard pricing [2] to the Internet, less to AWS regions). This will be a big factor in cost. I'm really curious to know how much all of this actually costs per PB per month.IME you almost never need to log and store this much data and there's almost no reason to ever store this much. Most logs are useless and you also have to question what the purpose is of any given log. Even if you're logging errors, you're likely to get the exact same value out of 1% sampling of logs than you are with logging everything.You might even get more value with 1% sampling because your query and monitoring might be a whole lot easier with substantially less data to deal with.Likewise, metrics tend to work just as well from sampled data.This post suggests 60 day log retention (100PB / 1.6PB daily). I would probably divide this into:1. Metrics storage. You can get this from logs but you'll often find it useful to write it directly if you can. Getting it from logs can be error-prone (eg a log format changes, the sampling rate changes and so on);2. Sampled data, generally for debugging. I would generally try to keep this at 10TB or less;3. "Offline" data, which you would generally only query if you absolutely had to. This is particularly true on S3, for example, because the write costs are basically zero but the read costs are expensive.Additionally, you'd want to think about data aggregation as a lot of your logs are only useful when combined in some way[1]: <a href="https://quickwit.io/docs/overview/concepts/indexing" rel="nofollow">https://quickwit.io/docs/overview/concepts/indexing</a>[2]: <a href="https://aws.amazon.com/s3/pricing/" rel="nofollow">https://aws.amazon.com/s3/pricing/</a>

评论 #40943226 未加载

评论 #40940111 未加载

evdubs10 个月前

Lots of storage to log all of the wash trading on their platform.

kristopolous10 个月前

So people don't build this out themselves?Regardless, there's some computer somewhere serving this. How do they service 1.6 PB per day? Are we talking tape backup? Disks? I've seen these mechanical arms that can pick tapes from a stack on a shelf, is that what is used? (example: <a href="https://www.osc.edu/sites/default/files/press/images/tapelibrary.jpg" rel="nofollow">https://www.osc.edu/sites/default/files/press/images/tapelib...</a>)For disks that's like ~60/day without redundancy, do they have people just constantly building out and onlining machines in some giant warehouse?I assume there's built in redundancy and someone's job to go through and replace failed units?This all sounds like it's absurdly expensive.And I'd have to assume they deal with at least 100x that scale because they have many other customers.Like what is that? 6,000 disks a day? Really?I hear these numbers of petabyte storage frequently. I think Facebook is around 5PB/daily. I've never had to deal with anything that large. Back in the colo days I saw a bunch of places but nothing like that.I'm imagining forklifts moving around pallets of shrink wrapped drives that get constantly deliveredAm I missing something here?Places like AWS should run tours. It'd be like going to the mint.

ATsch10 个月前

It's always very amusing how all of the blockchain companies wax lyrical about all of the huge supposed benefits of blockchains and how every industry and company is missing out by not adopting them and should definitely run a hyperledger private blockchain buzzword whatever.And then, even when faced with implementing a huge, audit critical, distributed append-only store, the thing they tell us blockchains are so useful for, they just use normal database tech like the rest if us. With one centralized infrastructure where most of the transactions in the network actually take place. Who's tech stack looks suspiciously like every other financial institution.I'm so glad we're ignoring 100 years of securities law to let all of this incredible innovation happen.

评论 #40939880 未加载

评论 #40944424 未加载