The article talks about 1.6 PB / day, which is 150 Gbps of log ingest traffic sustained. That's insane.<p>A change of the logging protocol to a more efficient format would yield such a huge improvement that it would be much cheaper than this infrastructure engineering exercise.<p>I suspect that everyone just assumes that the numbers represent the underling data volume, and that this cannot be decreased. Nobody seems to have heard of <i>write amplification</i>.<p>Let's say you want to collect a metric. If you do this with a JSON document format you'd likely end up ingesting records that are like the following made-up example:<p><pre><code> {
"timestamp": "2024-07-12T14:30:00Z",
"serviceName": "user-service-ba34sd4f14",
"dataCentre": "eastFoobar-1",
"zone": 3,
"cluster: "stamp-prd-4123",
"instanceId": "instance-12345",
"object": "system/foo/blargh",
"metric: "errors",
"units": "countPerMicroCentury",
"value": 391235.23921386
}
</code></pre>
It wouldn't surprise me if in reality this was actually 10x larger. For example, <i>just</i> the "resource id" of something in Azure is about this size, and it's just one field of many collected by <i>every</i> logging system in that cloud for <i>every</i> record. Similarly, I've cracked open the protocols and schema formats for competing systems and found 300x or worse write amplification being the typical case.<p>The actual data that needed to be collected was just:<p><pre><code> 391235.23921386
</code></pre>
In a binary format that would be 4 bytes, 8 if you think that you need to draw your metric graphs with a vertical precision of a millionth of a pixel and horizontal precision of a minute because you can't afford the exabytes of storage a higher collection frequency would require.<p>If you collect 4 bytes per metric in an array and record the start timestamp and the interval, you don't even need a timestamp per entry, just one per thousand or whatever. For a metric collected <i>every second</i> that's just 10 MB per month <i>before</i> compression. Most metrics change slowly or not at all and would compress down to mere kilobytes.