Alas, we’ve found the first thing I _dont_ like about fly. They’ve sadly bought into the TSDB model that everything is a counter instead of the more modern model that everything is a histogram.<p>Google has long since abandoned the Borgmon data model for histograms with monarch. The closest non google implementation is probably circonus. Sadly neither is available as open source software.<p>I can’t really blame fly for not individually building an open source modern metric db. But it’s sort of sad that the infra team I’m most impressed with has to use metric systems from 15 years ago when the rest of their stack is so cutting edge.
First time I've heard of Fly, went to check out their docs and found this awesome little mention:<p>> We use Lets Encrypt to issue certificates, and donate half of our SSL fees to them at the end of each calendar year.
> VictoriaMetrics<p>:(<p>Not actually Prometheus-compatible, sloppy code, spotty docs. I have no idea why this dumb product continues to attract users.<p><a href="https://prometheus.io/blog/2021/05/04/prometheus-conformance-remote-write-compliance/" rel="nofollow">https://prometheus.io/blog/2021/05/04/prometheus-conformance...</a><p>> Telegraf, which is to metrics sort of what Logstash is to logs: a swiss-army knife tool that adapts arbitrary inputs to arbitrary output formats. We run Telegraf agents on our nodes to scrape local Prometheus sources, and Vicky scrapes Telegraf. Telegraf simplifies the networking for our metrics; it means Vicky (and our iptables rules) only need to know about one Prometheus endpoint per node.<p>Normally you just use a regular Prometheus server to do this. Why add another, different technology to the stack?<p>> We spent some time scaling it with Thanos, and Thanos was a lot, as far as ops hassle goes.<p>It really isn't -- assuming you're not trying to bend Prometheus into something it isn't. Prometheus works using a federated, pull-based architecture. It expects to be near the things it's monitoring, and expects you to build out a hierarchy of infrastructure, in layers, to handle larger scopes.<p>This is structurally different to what I'll call the "clustering" model of scale, where you have all your data sources pushing their data, aggregating maybe on the machine or datacenter level, but then shuttling everything to a single central place, which you scale vertically from the perspective of your users. This appears to be what you want to do, based on the prevalence of push-based tech in your stack.<p>Prometheus doesn't work this way. Some people <i>really want</i> it to work this way, and have even created entire product lines that make it look as if it works this way (Cortex, M3db) but it's fundamentally just not how it's designed to be used. If you try to make it work this way yourself, you'll certainly get frustrated.
How big is the fly eng team at this point? You all seem to be doing a ton, I’m always kind of surprised these posts don’t end with the usual “we’re hiring” blurb that’s become the norm on these sorts of tech posts.
Hadn't heard of promxy before. In the past to reduce cardinality/deduplicate metrics, I've just ran another instance of Prometheus entirely in-memory and used rewrite rules.<p>Exposing a metrics endpoint for customers is nice. How do you manage the cardinality? I haven't used Victoria before, is it just better at high cardinality time series?
They mentionned that they decided against thanos for the storage of metrics, but would be curious to hear if other TSDB were considered. It is a hot space, I know about M3BD, Clickhouse, Timescale, influx, QuestDB, opentsdb, etc.
> When it comes to automated monitoring, there are two big philosophies, “checks” and “metrics”.<p>There's a third, "events". Just push an event out whenever something interesting happens, and let the monitoring tool decide whether to count, aggregate, histogram, alert, etc.<p>Events require less code in the app (no storage, no aggregation, no web server), and allow more flexibility in processing. I have used events to great effect. I am baffled as to why monitoring people still only talk about metrics.
> If you’re an Advent of Code kind of person and you haven’t already written a parser for this format, you’re probably starting to feel a twitch in your left eyelid. Go ahead and write the parser; it’ll take you 15 minutes and the world can’t have too many implementations of this exposition format. There's a lesson about virality among programmers buried in here somewhere.<p>Huh? Who gets excited about writing a parser?<p>What was wrong with "${key} ${value}" on separate lines?
Whose the target audience for fly?<p>I’m trying to understand the market they’re operating in. Big ol enterprises would probably want to run on AWS/GCP right? So would startups? What’s the long game? Genuine question.
Not to pick on Fly (seems nice), but on the trend for containers:<p><i>>if you’ve got a Docker container, it can be running on Fly in single-digit minutes.</i><p>I used to laugh at the old Plan 9 fortune, "... Forking an allegro process requires only seconds... -V. Kelly". Guess I'm not laughing anymore?<p>FWIW, performance of components is <i>the</i> barrier to composition in system design and development. You can't compose modules that take <i>seconds</i> to act, and still have something that is usable real-time.
> Fly.io transforms container images into fleets of micro-VMs running around the world on our hardware.<p>Oh boy!<p>> None of us have ever worked for Google, let alone as SREs. So we’re going out on a limb<p>Oh.... boy.<p>> We spent some time scaling it with Thanos, and Thanos was a lot, as far as ops hassle goes.<p>You know, they have these companies now, that will collect your metrics for you, so that you don't have to deal with ops hassle.<p>> In each Firecracker instance, we run our custom init,<p>... in Rust. Yes, the thing that is normally a shell script, is now a compiled program in a new language, that mostly just runs mkdir(), mount() and ethtool(). (<a href="https://github.com/superfly/init-snapshot/blob/public/src/bin/init/main.rs" rel="nofollow">https://github.com/superfly/init-snapshot/blob/public/src/bi...</a>). In a few years, when that component is passed off to a dedicated Ops team, and they find it hard to hire a sysadmin who also knows Rust, there will be some poor intern who learned Rust over the summer whose job is to rewrite that thing back into a shell script.