Leveraging AI for efficient incident response

110 pointsby Amaresh9 months ago

16 comments

LASR9 months ago

We've shifted our oncall incident response over to mostly AI at this point. And it works quite well.One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.

评论 #41327207 未加载

评论 #41326951 未加载

评论 #41330347 未加载

评论 #41329585 未加载

评论 #41329239 未加载

评论 #41328849 未加载

评论 #41327931 未加载

donavanm9 months ago

Im really interested in the implied restriction/focus on “code changes.”IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all. It _feels_ like theres an implied solution to tying running version back to deployment rev, to deployment artifacts, and vcs.Boundary conditions and state changes in the distributed system were the biggest bug bear I ran in to at AWS. Then below that were all of the “infra” style failures like network faults, latency, API quota exhaustion, etc. And for all the cloudformation/cdk/terraform in the world its non trivial to really discover those effects and tie them to a “code change.” Totally ignoring older tools that may be managed via CLI or the ol’ point and click.

评论 #41326450 未加载

评论 #41327182 未加载

评论 #41326414 未加载

pants29 months ago

> The biggest lever to achieving 42% accuracy was fine-tuning a Llama 2 (7B) model42% accuracy on a tiny, outdated model - surely it would improve significantly by fine-tuning Llama 3.1 405B!

评论 #41345915 未加载

nyellin9 months ago

We've open sourced something with similar goals that you can use today: <a href="https://github.com/robusta-dev/holmesgpt/">https://github.com/robusta-dev/holmesgpt/</a>We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.If anyone is curious, I did a webinar with PagerDuty on this recently.

评论 #41327434 未加载

评论 #41327409 未加载

mafribe9 months ago

The paper goes out of its way not to compare the 42% figure with anything. Is "42% within the top 5 suggestions" good or bad?How would an experienced engineer score on the same task?

TheBengaluruGuy9 months ago

Interesting. Just a few weeks back, I was reading about their previous work <a href="https://atscaleconference.com/the-evolution-of-aiops-at-meta-beyond-the-buzz/" rel="nofollow">https://atscaleconference.com/the-evolution-of-aiops-at-meta...</a> -- didn't realise there's more work!Also, some more researches in the similar space by other enterprises:Microsoft: <a href="https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pdf" rel="nofollow">https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pd...</a>Salesforce: <a href="https://blog.salesforceairesearch.com/pyrca/" rel="nofollow">https://blog.salesforceairesearch.com/pyrca/</a>Personal plug: I'm building a self-service AIOps platform for engineering teams (somewhat similar to this work by Meta). If you're looking to read more about it, visit -- <a href="https://docs.drdroid.io/docs/doctor-droid-aiops-platform">https://docs.drdroid.io/docs/doctor-droid-aiops-platform</a>

MOARDONGZPLZ9 months ago

I would love if they leveraged AI to detect AI on the regular Facebook feed. I visit occasionally and it’s just a wasteland of unbelievable AI content with tens of thousands of bot (I assume…) likes. Makes me sick to my stomach and I can’t even browse.

aray079 months ago

I do think AI will automate a lot of the grunt work involved with incidents and make the life of on-call engineers better.We are currently working on this at: <a href="https://github.com/opslane/opslane">https://github.com/opslane/opslane</a>We are starting by tackling adding enrichment to your alerts.

benreesman9 months ago

Way back in the day on FB Ads we trained a GBDT on a bunch of features extracted from the diff that had been (post-hoc) identified as the cause of a SEV.Unlike a modern LLM (or most any non-trivial NN), a GBDT’s feature importance is defensively rigorous.After floating the results to a few folks up the chain we burned it and forget where.

BurningFrog9 months ago

PSA:9 times out of 10, you can and should write "using" instead of "leveraging".

评论 #41329298 未加载

AeZ1E9 months ago

nice to see meta investing in AI investigation tools! but 42% accuracy doesn't sound too impressive to me... maybe there's still some fine-tuning needed for better results? glad to hear about the progress though!

评论 #41327197 未加载

ketzo9 months ago

This is really cool. My optimistic take on GenAI, at least with regard to software engineering, is that it seems like we're gonna have a lot of the boring / tedious parts of our jobs get a lot easier!

评论 #41326442 未加载

coding1239 months ago

AI 1: This user is suspicious, lock accountUser: Ahh, got locked out, contact support and waitAI 2: The user is not suspicious, unlock accountUser: Great, thank youAI 1: This account is suspicious, lock account

评论 #41328776 未加载

_pdp_9 months ago

I will be more interested to understand how they deal with injection attacks. Any alert where the attacker controls some parts of the text that ends up in the model could be used to either evade it worse use it to hack it. Slack had an issue like that recently.

devneelpatel9 months ago

This is exactly what we do at OneUptime.com. Show you AI generated possible Incident remediation based on your data + telemetry + code. All of this is 100% open-source.

minkles9 months ago

I'm going to point out the obvious problem here: 42% RC identification is shit.That means the first person on the call doing the triage has a 58% chance of being fed misinformation and bias which they have to distinguish from reality.Of course you can't say anything about an ML model being bad that you are promoting for your business.

评论 #41326765 未加载