This was a pleasant surprise to see on Hacker News this morning! I work on the Observability team at Stripe and have been the PM for Veneur (and the rest of our metrics & tracing pipeline work) pretty much since we released it ~2 years ago.<p>If you're interested in learning more about how Veneur works and why we built it, I gave a talk at Monitorama last year that explains the philosophy behind Veneur[0]. In short, a massive company like Google is able to build their on integrated observability stacks in-house, but almost any other smaller company is going to be relying on an array of open-source tools or third-party vendors for different parts of their observability tooling[1]. When using different tools, there are always going to be gaps between them, which leads to incomplete instrumentation and awkward (inter-)operability. By taking control of the pipeline that processes the data, we're able to provide fully integrated views into different aspects of our observability data.<p>The Monitorama talk is a year old at this point, so it doesn't cover some of the newer things Veneur has helped us to accomplish, but the core philosophy hasn't changed. I've given updated versions of the talk more recently at CraftConf (in May) and DevOpsDaysMSP (last week), but neither of those videos are online yet.<p>[0] <a href="https://vimeo.com/221049715" rel="nofollow">https://vimeo.com/221049715</a><p>[1] e.g. ELK/Papertrail/Splunk for logs, Graphite/Datadog/SignalFx for metrics, and maybe a third tool for tracing if you're lucky.
Am I the only one who is always slightly disappointed that neither the README file on Github nor the landing page at the website tells me why I would want to use the software in question? What problem it solves? Why might "a distributed, fault-tolerant observability pipeline" be interesting to programmers or anyone else? It seems like you've already got to be familiar with the problem space to understand what this is and what need it fulfills.<p>I'm not picking on this package. I see it all the time.<p>Can someone here explain to me what the use case is for this software?
It’s definitely interesting to see the different systems being built for monitoring across the different tech co’s.<p>M3 aggregator, Uber’s metrics aggregation tier is similar, except it has inbuilt replication and leader election on top of etcd to avoid any SPOF during deployments, failed instances, etc. Also it uses Cormode-Muthukrishnan for estimating percentiles by default, it has support for T-Digest too. Although these days submitting histogram bucket aggregates all the way from the client to aggregator then to storage is more popular as you can estimate percentiles across more dimensions and time windows at query time quite cheaply. You need to choose your buckets carefully though.<p>It too is open source, but needs some help to make it plug into other stacks more easily:
<a href="https://github.com/m3db/m3aggregator" rel="nofollow">https://github.com/m3db/m3aggregator</a>
When I'm evaluating a system like this what I want to read about is how is it hardened against client stupidity. For example, someone deploys an application in my datacenter and it emits metrics that have gibberish in their names (consider a common Java bug where a class lacks a toString, so the metric gets barfed out as foo.bar.0xCAFEBABE.baz). How does the system cope with this enormous, hyper-dimensional input?