科技回声

Hey Folks – Larry, Ajay and Rod here!We address the age old painful problem of digging through logs to find the root cause when a problem occurs. No-one likes searching through logs, and so we spent a few years analyzing 100’s of real world incidents to understand how humans troubleshoot in logs. And then we built a solution that automatically finds the same root cause indicators a human would have had to manually search for. We call it Root Cause as a Service. RCaaS works with any app and does not require manual training or rules. Our foundational thoughts and more details can be found here: https://www.zebrium.com/blog/its-time-to-automate-the-observer.Obviously, everyone is skeptical when they hear about RCaaS. We encourage you try it yourself, but we also have a really strong validation point. One of our customers performed a study using 192 actual customer incidents from 4 different products and found that Zebrium correctly identified the root cause indicators in the logs in over 95% of the incidents – see https://www.zebrium.com/cisco-validation.For those that are interested, this is actually our second SHOW HN post, our first was last June - https://news.ycombinator.com/item?id=23490609. The link in that post points to our current home page but our initial comment was, "We're excited to share Zebrium's autonomous incident detection software". At the time, our focus was on a tool that used unsupervised ML to automatically detect any kind of new or unknown software incident. We had done a lot of customer testing and were achieving > 90% detection accuracy in catching almost any kind of problem. But what we underestimated is just how high the bar is for incident detection. If someone is going to hook you up to a pager, then even an occasional false positive is enough for a user to start cursing your product! And users quickly forget about the times when your product saved their bacon by catching problems that they would otherwise have missed.But late last year we had a huge aha moment! Most customers already have monitoring tools in place that are really good at detecting problems, but what they don't have is an automated way to find the root cause. So, we built some really elegant integrations for Datadog, New Relic, Elastic, Grafana, Dynatrace, AppDynamics and ScienceLogic (and more to come via our open APIs) so that when there's a problem, you see details of the root cause directly on your monitoring dashboard. Here's a 2 minute demo of what it looks like: https://youtu.be/t83Egs5l8ok.You're welcome to sign-up for a free trial at https://www.zebrium.com and we'd love to hear your questions and feedback.

6 条评论

coward123将近 3 年前

I give you credit for working in this space and trying to create a more automated approach... I spent many years in the app performance world both as a consultant and working on products, so again - good on you.For what it's worth, my immediate reaction is that you might work on different terminology in how you present what your product does. I get that you are trying to create a contrived example in order to demo the product and show value, and that can be a very difficult thing to do. That said, in my line of thinking, an HTTP 500 isn't actually the root cause, it's a symptom of the cause. The password being set incorrectly isn't the root cause either. The real root cause is something in the deployment pipeline, the configuration control, the change management, the architecture, etc, etc. that got us to this point.I guess I'm struggling here a bit too because I think of how many times I would have been the manual version of this, where I would show information like this to a client's technical team, and I had to absolutely spoon feed them on how to remedy. I remember a team that was supposed to be crack guys from a vendor, an app team, etc who had been working on a problem for months that I fixed in a matter of hours because they just didn't understand what the line in the log meant. So it isn't clear to me how your product is actually creating better visibility + interpretation of the problem toward a solution.In the ten or so years I did that kind of work, what really stood out to me was that the seemingly obvious tech issues were not obvious because of a lack of education / experience /training on the part of the client personnel, but more often than not the real problems were much much larger architectural issues way beyond just the message in the log. Those are much harder to both identify and correct, but products like yours and the ones you integrate with are almost just a band-aid on the problem.So, take that for what it's worth - again, good work trying to improve the state of the art in this area.

评论 #31787103 未加载

kordlessagain将近 3 年前

Having worked on a machine learning time series document search solution for the last 2 years, I know exactly why the cost of this is so high. Running logs through a model must be VERY expensive.I had a good friend at Splunk who passed a few years ago. He was working on something similar, well before we had decent models. His anomaly detection used differences in regular expression patterns to detect "strange things". I guess that's why he carried the title "Chief Mind".I'm excited where ML and time series data is going. It's going to be interesting!

评论 #31785129 未加载

treis将近 3 年前

> Here's a 2 minute demo of what it looks like: <a href="https://youtu.be/t83Egs5l8ok" rel="nofollow">https://youtu.be/t83Egs5l8ok</a>.The problem with this demo is that it uses something that's 100% broken due to something that happened immediately before the failure. That's not hard to debug and I don't really see value there.The scenarios that could use this sort of tool are things like someone turning on a flag that breaks 1% of a specific end point but only 0.1% of overall requests. So something sub-alert level and with not nearly an immediately obvious cause & effect. If you can detect something like that without generating a ton of noise and give a hint to root cause then that'd be something killer.It's a cool idea and I can see the value. We've had scenarios like the one I mentioned (and worse) go undetected because of the noise Sentry generates. If you can solve that then you've really got something.

评论 #31783518 未加载

评论 #31782903 未加载

randombits0将近 3 年前

If rnd() > .5 printf(“it’s DNS!”)

评论 #31784845 未加载

throwaway81523将近 3 年前

95.8% of the time it's kind of obvious what happened, at least with reasonable monitoring. Digging through logs is for the other 4.2% of the time. Having done that kind of thing more than once, I don't see ML as being helpful. You often end up writing scripts to search for specific combnations of events, that are only identifiable after the incident has happened.

Show HN: Root Cause as a Service – Never dig through logs again

6 条评论

Show HN: Root Cause as a Service – Never dig through logs again

6 条评论