Congrats on the launch. Having worked at a startup in the AIOps space, I can can offer a few suggestions.<p>1. No matter how good your AI, it will make mistakes. Users need ways to provide feedback or filtering to avoid bad alert fatigue. Giving users a sense of control is critical.<p>2. Most larger shops will have dozens of monitoring tools already generating alerts. Consider ingesting existing alerts as another algorithmic signal.<p>3. The real root cause of an incident often won't show up in logs. Don't assume that the earliest event in a cluster is causal.<p>4. The more context you can provide an operator looking at a potential incident, the better. Modern SEIM tools do an ok job here. Consider pulling in topology or other enrichment sources and matching entity names/IDs to log data.<p>Good luck. Contact in profile if you'd like to chat further.
Hey folks,<p>Larry, Ajay and Rod here!<p>We're excited to share Zebrium's autonomous incident detection software. Zebrium uses unsupervised machine learning to detect software incidents and show you root cause. It's built to catch even "Unknown Unknowns" (problems you don't have alert rules built for), the FIRST time you hit them. We believe autonomous incident detection is a important tool for defeating complexity and crushing resolution time.<p><i></i>* Get Started <i></i>*<p>1) Go to our website and click "Get Started Free". Enter your name, email and set a password.
2) Install our collectors from a list of supported platforms. For K8s it's a one command install. Join the newly created private Slack channel for alerts (or add a webhook for your own)
3) That's it. Automatic incident detection starts within an hour and quickly gets good. You can drill down into logs & metrics for more context if needed.<p>Getting started takes less than 2 minutes. It's free for 30 days with larger limits and then free forever for up to 500MB/day.<p><i></i>* Here's what you WON'T have to do <i></i>*<p>Manual training, code changes, connectors, parsers, configuration, waiting, hunting, searching, alert rules, etc! It works with any app or stack.<p><i></i>* How It Works <i></i>*<p>We structure all the logs and metrics we collect in-line at ingest, leverage this structure to find normal and anomalous patterns of activity, then use a point process model to identify unusually correlated anomalies (across different data streams) to auto-detect incidents and find the relevant root-cause indicators. Experience with over a thousand real-world incidents across over a hundred stacks has confirmed that software behaves in certain fundamental ways when it breaks.<p>It turns out that we can detect important incidents automatically, with a root-cause indicator when the logs and metrics reflect it. Zebrium works well enough that our own team relies on it, and we believe you'll want to use it, too.
One question:<p>Where is the systematic evidence that this product actually works? What's the general false positive and false negative rates in standard setup? Did you construct various failed environments and measure the quality of the reports? For this sort of thing I would expect a simulation of at least 10-20 failure environments with detailed false positive/false negative rate measurements. Right now you have a lot of cherry picked examples without any sort of systematic setup (in particular, you don't seem to talk about false positives anywhere).
As a founder in the monitoring space, and now heading up the core monitoring team at Netflix, I had a chance to work with Zebrium and have to say the technology is impressive. Unlike other anomaly detection services, they've done a lot of work to get decent incidents without too much noise completely unsupervised - this is definitely the next generation of observability and Zebrium has a clear head start in this space!
Just ran through your intro video. If it does really what it says, this is a great product. I'll have my team test this tomorrow. Good luck on your launch.
This is a game changer. I've met the team and they've got something special here.<p>You can see one of their talks and a great discussion at a BayLISA.org meeting.<p><a href="https://www.youtube.com/watch?v=gNiWtoxJ9iM" rel="nofollow">https://www.youtube.com/watch?v=gNiWtoxJ9iM</a>
Very cool, would love something like this. Your video gives a fairly straightforward incident response which traditional tools would work equally well on. Can you describe a situation that Zebrium does better than legacy tools? Perhaps a hypothetical unknown unknown.
Nice website, folks. The 2-minute intro video does a great job presenting the value-prop. It looks like the solution detects events with a high probability of being a problem automatically via ML. Can I define my own events using custom condition criteria as well?
Congrats on the launch, and good luck! Looks fascinating. Looking forward to the future release that <i>fixes</i> the incidents as well, and just notifies us afterwards as a courtesy. =)