Some of our tools do so much allocation that capturing all the information (using MTuner) to disk and later loading require by itself hundreths of GB.<p>Instead I've added random sampling - e.g. if ptr % modulo > level - output it or not.<p>Another factor that slows down is doing a callstack capture. It's not for free at all, like on Windows it has to go through the exception handlers, etc. I think perfetto simply captures the whole stack (need to check again), and then offline decodes it - or something like this.<p>Windows ETW Tracing can also capture the stack, but I guess it'll incur also some penalty - it can't come for free.<p>I also wish there was some kind of standard binary format for emitting alloc/free sequences with callstack/etc.