Fly’s Prometheus Metrics

164 pointsby elithrarabout 4 years ago

16 comments

kasey_junkabout 4 years ago

Alas, we’ve found the first thing I _dont_ like about fly. They’ve sadly bought into the TSDB model that everything is a counter instead of the more modern model that everything is a histogram.Google has long since abandoned the Borgmon data model for histograms with monarch. The closest non google implementation is probably circonus. Sadly neither is available as open source software.I can’t really blame fly for not individually building an open source modern metric db. But it’s sort of sad that the infra team I’m most impressed with has to use metric systems from 15 years ago when the rest of their stack is so cutting edge.

评论 #27151826 未加载

评论 #27152580 未加载

评论 #27150013 未加载

评论 #27157258 未加载

评论 #27151515 未加载

kronoabout 4 years ago

First time I've heard of Fly, went to check out their docs and found this awesome little mention:> We use Lets Encrypt to issue certificates, and donate half of our SSL fees to them at the end of each calendar year.

评论 #27148430 未加载

hstaababout 4 years ago

Does anyone here have experience using Fly? I’ve seen a few of their posts and it seems quite nice.

评论 #27149211 未加载

评论 #27148235 未加载

评论 #27148254 未加载

评论 #27149841 未加载

评论 #27149177 未加载

评论 #27148330 未加载

sagichmalabout 4 years ago

> VictoriaMetrics:(Not actually Prometheus-compatible, sloppy code, spotty docs. I have no idea why this dumb product continues to attract users.<a href="https://prometheus.io/blog/2021/05/04/prometheus-conformance-remote-write-compliance/" rel="nofollow">https://prometheus.io/blog/2021/05/04/prometheus-conformance...</a>> Telegraf, which is to metrics sort of what Logstash is to logs: a swiss-army knife tool that adapts arbitrary inputs to arbitrary output formats. We run Telegraf agents on our nodes to scrape local Prometheus sources, and Vicky scrapes Telegraf. Telegraf simplifies the networking for our metrics; it means Vicky (and our iptables rules) only need to know about one Prometheus endpoint per node.Normally you just use a regular Prometheus server to do this. Why add another, different technology to the stack?> We spent some time scaling it with Thanos, and Thanos was a lot, as far as ops hassle goes.It really isn't -- assuming you're not trying to bend Prometheus into something it isn't. Prometheus works using a federated, pull-based architecture. It expects to be near the things it's monitoring, and expects you to build out a hierarchy of infrastructure, in layers, to handle larger scopes.This is structurally different to what I'll call the "clustering" model of scale, where you have all your data sources pushing their data, aggregating maybe on the machine or datacenter level, but then shuttling everything to a single central place, which you scale vertically from the perspective of your users. This appears to be what you want to do, based on the prevalence of push-based tech in your stack.Prometheus doesn't work this way. Some people really want it to work this way, and have even created entire product lines that make it look as if it works this way (Cortex, M3db) but it's fundamentally just not how it's designed to be used. If you try to make it work this way yourself, you'll certainly get frustrated.

评论 #27149354 未加载

ryanschneiderabout 4 years ago

How big is the fly eng team at this point? You all seem to be doing a ton, I’m always kind of surprised these posts don’t end with the usual “we’re hiring” blurb that’s become the norm on these sorts of tech posts.

评论 #27148086 未加载

xferabout 4 years ago

Do you plan to join the bandwidth alliance at cloudflare? Data prices are quite costly. It's the same as GCP/AWS.

jzelinskieabout 4 years ago

Hadn't heard of promxy before. In the past to reduce cardinality/deduplicate metrics, I've just ran another instance of Prometheus entirely in-memory and used rewrite rules.Exposing a metrics endpoint for customers is nice. How do you manage the cardinality? I haven't used Victoria before, is it just better at high cardinality time series?

评论 #27151490 未加载

评论 #27153823 未加载

Syttenabout 4 years ago

They mentionned that they decided against thanos for the storage of metrics, but would be curious to hear if other TSDB were considered. It is a hot space, I know about M3BD, Clickhouse, Timescale, influx, QuestDB, opentsdb, etc.

评论 #27148609 未加载

twicabout 4 years ago

> When it comes to automated monitoring, there are two big philosophies, “checks” and “metrics”.There's a third, "events". Just push an event out whenever something interesting happens, and let the monitoring tool decide whether to count, aggregate, histogram, alert, etc.Events require less code in the app (no storage, no aggregation, no web server), and allow more flexibility in processing. I have used events to great effect. I am baffled as to why monitoring people still only talk about metrics.

tantalorabout 4 years ago

> If you’re an Advent of Code kind of person and you haven’t already written a parser for this format, you’re probably starting to feel a twitch in your left eyelid. Go ahead and write the parser; it’ll take you 15 minutes and the world can’t have too many implementations of this exposition format. There's a lesson about virality among programmers buried in here somewhere.Huh? Who gets excited about writing a parser?What was wrong with "${key} ${value}" on separate lines?

评论 #27149428 未加载

评论 #27150895 未加载

评论 #27149581 未加载

pm90about 4 years ago

Whose the target audience for fly?I’m trying to understand the market they’re operating in. Big ol enterprises would probably want to run on AWS/GCP right? So would startups? What’s the long game? Genuine question.

评论 #27149427 未加载

vira28about 4 years ago

I really like those diagrams. What tools/software is used to generate?

评论 #27149307 未加载

yannoninatorabout 4 years ago

I'm not quite sure if fly is production ready for our usecase yet, but it does look awesome though.

评论 #27148176 未加载

jimmyedabout 4 years ago

Thanos is a lot of ops, true. But did they try Cortex? Oh, and M3?

评论 #27150080 未加载

dexenabout 4 years ago

Not to pick on Fly (seems nice), but on the trend for containers:>if you’ve got a Docker container, it can be running on Fly in single-digit minutes.I used to laugh at the old Plan 9 fortune, "... Forking an allegro process requires only seconds... -V. Kelly". Guess I'm not laughing anymore?FWIW, performance of components is the barrier to composition in system design and development. You can't compose modules that take seconds to act, and still have something that is usable real-time.

评论 #27147911 未加载

评论 #27147774 未加载

评论 #27148977 未加载

0xbadcafebeeabout 4 years ago

> Fly.io transforms container images into fleets of micro-VMs running around the world on our hardware.Oh boy!> None of us have ever worked for Google, let alone as SREs. So we’re going out on a limbOh.... boy.> We spent some time scaling it with Thanos, and Thanos was a lot, as far as ops hassle goes.You know, they have these companies now, that will collect your metrics for you, so that you don't have to deal with ops hassle.> In each Firecracker instance, we run our custom init,... in Rust. Yes, the thing that is normally a shell script, is now a compiled program in a new language, that mostly just runs mkdir(), mount() and ethtool(). (<a href="https://github.com/superfly/init-snapshot/blob/public/src/bin/init/main.rs" rel="nofollow">https://github.com/superfly/init-snapshot/blob/public/src/bi...</a>). In a few years, when that component is passed off to a dedicated Ops team, and they find it hard to hire a sysadmin who also knows Rust, there will be some poor intern who learned Rust over the summer whose job is to rewrite that thing back into a shell script.

评论 #27152605 未加载

评论 #27150460 未加载