Launch HN: Opstrace (YC S19) – open-source Datadog

316 pointsby spahlover 4 years ago

Hi HN!Seb here, with my co-founder Mat. We are building an open-source observability platform aimed at the end user. We assemble what we consider the best open source APIs and interfaces such as Prometheus and Grafana, but make them as easy to use and featureful as Datadog, with for example TLS and authentication by default. It's scalable (horizontally and vertically) and upgradable without a team of experts. Check it out here: <a href="http://opstrace.com/" rel="nofollow">http://opstrace.com/</a> & <a href="https://github.com/opstrace/opstrace" rel="nofollow">https://github.com/opstrace/opstrace</a>About us: I co-founded dotCloud which became Docker, and was also an early employee at Cloudflare where I built their monitoring system back when there was no Prometheus (I had to use OpenTSDB :-). I have since been told it's all been replaced with modern stuff—thankfully! Mat and I met at Mesosphere where, after building DC/OS, we led the teams that would eventually transition the company to Kubernetes.In 2019, I was at RedHat and Mat was still at Mesosphere. A few months after IBM announced purchasing RedHat, Mat and I started brainstorming problems that we could solve in the infrastructure space. We started interviewing a lot of companies, always asking them the same questions: "How do you build and test your code? How do you deploy? What technologies do you use? How do you monitor your system? Logs? Outages?" A clear set of common problems emerged.Companies that used external vendors—such as CloudWatch, Datadog, SignalFX—grew to a certain size where cost became unpredictable and wildly excessive. As a result (one of many downsides we would come to uncover) they monitored less (i.e. just error logs, no real metrics/logs in staging/dev and turning metrics off in prod to reduce cost).Companies going the opposite route—choosing to build in-house with open source software—had different problems. Building their stack took time away from their product development, and resulted in poorly maintained, complicated messes. Those companies are usually tempted to go to SaaS but at their scale, the cost is often prohibitive.It seemed crazy to us that we are still stuck in this world where we have to choose between these two paths. As infrastructure engineers, we take pride in building good software for other engineers. So we started Opstrace to fix it.Opstrace started with a few core principles: (1) The customer should always own their data; Opstrace runs entirely in your cloud account and your data never leaves your network. (2) We don’t want to be a storage vendor—that is, we won’t bill customers by data volume because this creates the wrong incentives for us. (AWS and GCP are already pretty good at storage.) (3) Transparency and predictability of costs—you pay your cloud provider for the storage/network/compute for running Opstrace and can take advantage of any credits/discounts you negotiate with them. We are incentivized to help you understand exactly where you are spending money because you pay us for the value you get from our product with per-user pricing. (For more about costs, see our recent blog post here: <a href="https://opstrace.com/blog/pulling-cost-curtain-back" rel="nofollow">https://opstrace.com/blog/pulling-cost-curtain-back</a>). (4) It should be REAL Open Source with the Apache License, Version 2.0.To get started, you install Opstrace into your AWS or GCP account with one command: `opstrace create`. This installs Opstrace in your account, creates a domain name and sets up authentication for you for free. Once logged in you can create tenants that each contain APIs for Prometheus, Fluentd/Loki and more. Each tenant has a Grafana instance you can use. A tenant can be used to logically separate domains, for example, things like prod, test, staging or teams. Whatever you prefer.At the heart of Opstrace runs a Cortex (<a href="https://github.com/cortexproject/cortex" rel="nofollow">https://github.com/cortexproject/cortex</a>) cluster to provide the above-mentioned scalable Prometheus API, and a Loki (<a href="https://github.com/grafana/loki" rel="nofollow">https://github.com/grafana/loki</a>) cluster for the logs. We front those with authenticated endpoints (all public in our repo). All the data ends up stored only in S3 thanks to the amazing work of the developers on those projects.An "open source Datadog" requires more than just metrics and logs. We are actively working on a new UI for managing, querying and visualizing your data and many more features, like automatic ingestion of logs/metrics from cloud services (CloudWatch/Stackdriver), Datadog compatible API endpoints to ease migrations and side by side comparisons and synthetics (e.g. Pingdom). You can follow along on our public roadmap: <a href="https://opstrace.com/docs/references/roadmap" rel="nofollow">https://opstrace.com/docs/references/roadmap</a>.We will always be open source, and we make money by charging a per-user subscription for our commercial version which will contain fine-grained authz, bring-your-own OIDC and custom domains.Check out our repo (<a href="https://github.com/opstrace/opstrace" rel="nofollow">https://github.com/opstrace/opstrace</a>) and give it a spin (<a href="https://opstrace.com/docs/quickstart" rel="nofollow">https://opstrace.com/docs/quickstart</a>).We’d love to hear what your perspective is. What are your experiences related to the problems discussed here? Are you all happy with the tools you’re using today?

29 comments

brodouevencodeover 4 years ago

We use [insert very large application performance monitoring tool here] for workloads running in [insert very, very large cloud provider here] and after examining our deployments, concluded that we were spending nearly $13k/mo for data transfer out expenditures because the monitoring agents have crazy aggressive defaults. Seems like running our own (which may be worthwhile) would alleviate anything like that.

评论 #25993574 未加载

评论 #26005617 未加载

评论 #25992841 未加载

评论 #25998447 未加载

tailspin2019over 4 years ago

Nicely designed site, great logo, but after clicking around a bit (and looking at GitHub) I’m confused by what this product actually is.DataDog has a UI. Does Opstrace? Or is it just a CLI/API based tool?If you actually have a UI element to your product you’re doing a huge disservice to yourself by not actually showing this anywhere...EDIT: I don’t mean to sound negative, I’m wondering if positioning this against Datadog is going to create immediate, potentially incorrect, expectations in people’s minds as to what this product might provide.From first impressions I’d say this is much closer to Prometheus (which does have a UI but it’s so basic it may as well not - but then the UI is not the point of Prometheus).

评论 #25996154 未加载

评论 #25996302 未加载

sciurusover 4 years ago

It looks like you're largely selling a fancy installer for software primarily developed by another company, Grafana Labs. They offer both open source, hosted SaaS, and paid-for "enterprise" versions of their software.Why should someone choose Opstrace over purchasing from them directly?

评论 #26002927 未加载

dudeinjapanover 4 years ago

Hi there, at TableCheck (www.tablecheck.com) we recently adopted Lightstep.In a nutshell, running all these various components (Grafana, etc) is a royal pain in the neck. Even if `opstrace create` spawns them easily, the problem is running/maintaining them. We want someone to run these for us as a SaaS/PaaS and we're happy to pay them.Re: your principles:(1) The customer should always own their data --> we agree. However, we are happy for you to be a custodian of that data.(2) We don’t want to be a storage vendor --> neither do we. We want storage to be someone else's problem. We're happy for you to use a cloud platform like AWS/GCP and charge us a 50% markup.(3/4) Transparency, predictability of costs, open source --> all excellent.

评论 #25998717 未加载

评论 #26002341 未加载

评论 #26001417 未加载

jarymover 4 years ago

Very exciting! Question: your homepage says it’ll always be Apache 2 but what will you do if someone like AWS rebrands your work (looking over at Elastic here)?

评论 #25992001 未加载

hangonhnover 4 years ago

Damn. That's one hell of a set of credentials for the founders.I was the engineer who was heavily involved with monitoring at my last job and a lot of what this is doing aligns with what I would have done myself. At my new job, I work on different stuff but I can see we're going to run into monitoring issues soon too. I'm so, so, so glad this is an option because I do not want to rebuild that stuff all over again. Getting monitoring scalable and robust is HARD!

评论 #25995323 未加载

boundlessdreamzover 4 years ago

1. It would be great if you can integrate with <a href="https://vector.dev/" rel="nofollow">https://vector.dev/</a>. Also saves you the effort of integrating with many sources2. When opstrace is setup in AWS/GCP, what is the typical fixed cost?

评论 #25992392 未加载

stevemcgheeover 4 years ago

FWIW, I was able to play with a preview and found it straightforward to set up and it kinda just did what I expected. I'm happy to see them taking next steps here. Good luck opstrace!

tamasnetover 4 years ago

This looks very promising, thank you and congrats!Also, please don't forget about people (like me) who don't run on $MAJOR_CLOUD_PROVIDER. I'd be curious to try this e.g. on self-operated Docker w/ Minio.

评论 #25992454 未加载

sneakover 4 years ago

> We will always be open source, and we make money by charging a per-user subscription for our commercial version which will contain fine-grained authz, bring-your-own OIDC and custom domains.Seems to me that these are at odds. If you're open source, why does anyone have to pay for these things?If you're open core, I think it's mighty misleading to say things like "We will always be open source" because then not only is it untrue on its face, but also if someone contributes useful features to the open source project that compete with or supplant your paid proprietary bits, you are incentivized to refuse to merge their work - extremely not in the spirit of open source.My perspective, which you asked for, is that open core is dishonest, and that you should be honest with yourselves about being a proprietary software vendor if that's indeed your plan, and stop with the open source posturing.If I've misunderstood you, then I apologize.

评论 #25997279 未加载

评论 #25996376 未加载

评论 #25995976 未加载

zaczekadamover 4 years ago

Hey, I think this might be the coolest product intro I've read.My two points - right now docs are clearly targeting users familiar with the competition but for someone like me who does not know similar products, a 'how it works' section with examples would be awesome.Fingers crossed!

评论 #25992534 未加载

arianvanpover 4 years ago

Your mascot is almost exactly identical to <a href="https://scylladb.com/" rel="nofollow">https://scylladb.com/</a> 's mascot. Is there any connection; or a happy accident?

评论 #25992419 未加载

mrwnmonmover 4 years ago

Man, I was hoping someone would do this. Thanks very much. Please please please, care about the design. I don't know why open source projects always have bad design.Wish you all the best. and Congratulations!

评论 #25996505 未加载

snissnover 4 years ago

hi! Some quick perspective - my thoughts looking into this are "ok cool what metrics do i get for free? cpu load? disk usage? the hard to find memory usage?" and i just get lost in your home page without any examples of what the dashboard looks like

评论 #25994113 未加载

评论 #25993965 未加载

tmztover 4 years ago

You mentioned Loki in your post. I evaluated it for our company and was reasonably impressed with the simplicity of setup and efficient storage. Where it failed us was the difficulty searching by customer identifiers or other "high cardinality" labels, or full-text. There's a longstanding issue [1] on the Github for this. Are you doing anything to improve log search versus an Elasticsearch cluster, for instance?More broadly, how are you contributing to the upstream projects?[1] <a href="https://github.com/grafana/loki/issues?page=7&q=is%3Aissue+is%3Aopen" rel="nofollow">https://github.com/grafana/loki/issues?page=7&q=is%3Aissue+i...</a>

评论 #25999109 未加载

rubiquityover 4 years ago

Nice. I've talked myself out of starting a monitoring product at least a few dozen times. As you point out, customers either get to choose between being gouged or run their own spaghetti.On top of bad UX, I do think the storage layer is where customers are really getting hit by these companies. The big players are using very unoptimized ingestion and querying layers and pretending like tiered storage never happened. Developers share some of the blame too by not being at all pragmatic about how long and how much to keep. It's a tough nut to crack.What's the plan for commercial? They run it themselves and pay per user? If so, that's refreshing.

评论 #25996098 未加载

GeneralTspoonover 4 years ago

This looks super cool!We just moved away from Datadog because their log storage pricing is too high for us. We moved to BigQuery instead. But the interface kind of sucks.Would love to get this up and running. A couple of questions:1. Is it possible to setup outside of AWS/GCP? I would like to set this up on a dedicated server.2. If not - then do you have a pricing comparison page where you give some example figures? e.g. to ingest 1 billion log lines from Apache per month it will cost you roughly $X in AWS hosting fees and $Y per seat to use Opstrace

评论 #25993825 未加载

mleonhardover 4 years ago

> opstrace create -c CONFIG_FILE_PATH PROVIDER CLUSTER_NAME> opstrace destroy PROVIDER CLUSTER_NAME> opstrace list PROVIDERI want to keep cluster config in source control, track deployment changes in code reviews, and automate deployments. Do you have any plans to add an 'apply' command to support this?$ opstrace apply -c CONFIG_FILE_PATH [--dry-run] PROVIDER CLUSTER_NAME

评论 #25998770 未加载

richardwover 4 years ago

Point around incentives: We use Dynatrace. I’m sure it’s an eye-watering price but I do like that everyone who wants a license can get one. I don’t have to consider costs to add an entire dev team and teach them how to use it. It also means an entire dev team knows how to use it for future jobs.

评论 #25995904 未加载

polskibusover 4 years ago

What are your plans on supporting open telemetry? Can I send open telemetry data to opstrace?

评论 #25995424 未加载

评论 #25995419 未加载

rtkaratekidover 4 years ago

Looking through the docs I'm seeing there will be (at some point) an API. Does this include ways to integrate data coming from non-Opstrace sources? My specific case is an in-house monitor that basically just generates data.

评论 #26001965 未加载

ogazittover 4 years ago

Congrats on launching - looks awesome! It's about time we have an open source datadog :)Also, it's great to see the early focus on developer experience - "opstrace create".

评论 #25996794 未加载

opsunitover 4 years ago

Why should I run this instead of renewing my Wavefront contract?

评论 #25994653 未加载

thow_away_4242over 4 years ago

Meanwhile, AWS getting the "AWS Opstrace Service" branding and marketing pages ready.

评论 #25994550 未加载

alexhutchesonover 4 years ago

One pain point with Prometheus is that is has relatively weak support for quantiles, histograms, and sets[1]:- Histograms require manually specifying the distribution of your data, which is time-consuming, lossy, and can introduce significant error bands around your quantile estimates.- Quantiles calculated via the Prometheus "summary" feature are specific to a given host, and not aggregatable, which is almost never what you want (you normally want to see e.g. the 95th percentile value of request latency for all servers of a given type, or all servers within a region). Quantiles can be calculated from histograms instead, but that requires a well-specified histogram and can be expensive at query time.- As far as I know, Prometheus doesn't have any explicit support for unique sets. You can compute this at query time, but persisting and then querying high-cardinality data in this way is expensive.Understanding the distribution of your data (rather than just averages) is arguably the most important feature you want from a monitoring dashboard, so the weak support for quantiles is very limiting.Veneur[2] addresses these use-cases for applications that use DogStatsD[3] by using clever data structures for approximate histograms[4] and approximate sets[5], but I believe its integration with Prometheus is limited and currently only one-way - there is a CLI app to poll Prometheus metrics and push them into Veneur[6], but there's no output sink for Veneur to write to Prometheus (or expose metrics for a Prometheus instance to poll), and you aren't able to use the approximate histogram or approximate set datatypes if you go that route, because they can't be expressed as Prometheus metrics.It would be extremely useful to have something similar for Prometheus, either by integrating with Veneur or implementing those data structures as an extension to Prometheus.[1] <a href="https://prometheus.io/docs/practices/histograms/" rel="nofollow">https://prometheus.io/docs/practices/histograms/</a>[2] <a href="https://github.com/stripe/veneur" rel="nofollow">https://github.com/stripe/veneur</a>[3] <a href="https://docs.datadoghq.com/developers/dogstatsd/" rel="nofollow">https://docs.datadoghq.com/developers/dogstatsd/</a>[4] <a href="https://github.com/stripe/veneur#approximate-histograms" rel="nofollow">https://github.com/stripe/veneur#approximate-histograms</a>[5] <a href="https://github.com/stripe/veneur#approximate-sets" rel="nofollow">https://github.com/stripe/veneur#approximate-sets</a>[6] <a href="https://github.com/stripe/veneur/tree/master/cmd/veneur-prometheus/" rel="nofollow">https://github.com/stripe/veneur/tree/master/cmd/veneur-prom...</a>

评论 #26012534 未加载

评论 #26035385 未加载

tobilgover 4 years ago

Great job, congratulations from an ex-Mesosphere colleague!

nickstinematesover 4 years ago

Congrats! This is really exciting

rockylukeover 4 years ago

Congratulations! You did a really great job.

NSMyselfover 4 years ago

Looking good, congrats on launching