We've shifted our oncall incident response over to mostly AI at this point. And it works quite well.<p>One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.<p>These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.<p>We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.<p>Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.
Im really interested in the implied restriction/focus on “code changes.”<p>IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all. It _feels_ like theres an implied solution to tying running version back to deployment rev, to deployment artifacts, and vcs.<p>Boundary conditions and state changes in the distributed system were the biggest bug bear I ran in to at AWS. Then below that were all of the “infra” style failures like network faults, latency, API quota exhaustion, etc. And for all the cloudformation/cdk/terraform in the world its non trivial to really discover those effects and tie them to a “code change.” Totally ignoring older tools that may be managed via CLI or the ol’ point and click.
> The biggest lever to achieving 42% accuracy was fine-tuning a Llama 2 (7B) model<p>42% accuracy on a tiny, outdated model - surely it would improve significantly by fine-tuning Llama 3.1 405B!
We've open sourced something with similar goals that you can use today: <a href="https://github.com/robusta-dev/holmesgpt/">https://github.com/robusta-dev/holmesgpt/</a><p>We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.<p>What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.<p>If anyone is curious, I did a webinar with PagerDuty on this recently.
The paper goes out of its way <i>not</i> to compare the 42% figure with anything. Is <i>"42% within the top 5 suggestions"</i> good or bad?<p>How would an experienced engineer score on the same task?
Interesting. Just a few weeks back, I was reading about their previous work <a href="https://atscaleconference.com/the-evolution-of-aiops-at-meta-beyond-the-buzz/" rel="nofollow">https://atscaleconference.com/the-evolution-of-aiops-at-meta...</a> -- didn't realise there's more work!<p>Also, some more researches in the similar space by other enterprises:<p>Microsoft: <a href="https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pdf" rel="nofollow">https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pd...</a><p>Salesforce: <a href="https://blog.salesforceairesearch.com/pyrca/" rel="nofollow">https://blog.salesforceairesearch.com/pyrca/</a><p>Personal plug: I'm building a self-service AIOps platform for engineering teams (somewhat similar to this work by Meta). If you're looking to read more about it, visit -- <a href="https://docs.drdroid.io/docs/doctor-droid-aiops-platform">https://docs.drdroid.io/docs/doctor-droid-aiops-platform</a>
I would love if they leveraged AI to detect AI on the regular Facebook feed. I visit occasionally and it’s just a wasteland of unbelievable AI content with tens of thousands of bot (I assume…) likes. Makes me sick to my stomach and I can’t even browse.
I do think AI will automate a lot of the grunt work involved with incidents and make the life of on-call engineers better.<p>We are currently working on this at: <a href="https://github.com/opslane/opslane">https://github.com/opslane/opslane</a><p>We are starting by tackling adding enrichment to your alerts.
Way back in the day on FB Ads we trained a GBDT on a bunch of features extracted from the diff that had been (post-hoc) identified as the cause of a SEV.<p>Unlike a modern LLM (or most any non-trivial NN), a GBDT’s feature importance is defensively rigorous.<p>After floating the results to a few folks up the chain we burned it and forget where.
nice to see meta investing in AI investigation tools! but 42% accuracy doesn't sound too impressive to me... maybe there's still some fine-tuning needed for better results? glad to hear about the progress though!
This is really cool. My optimistic take on GenAI, at least with regard to software engineering, is that it seems like we're gonna have a lot of the boring / tedious parts of our jobs get a lot easier!
AI 1: This user is suspicious, lock account<p>User: Ahh, got locked out, contact support and wait<p>AI 2: The user is not suspicious, unlock account<p>User: Great, thank you<p>AI 1: This account is suspicious, lock account
I will be more interested to understand how they deal with injection attacks. Any alert where the attacker controls some parts of the text that ends up in the model could be used to either evade it worse use it to hack it. Slack had an issue like that recently.
This is exactly what we do at OneUptime.com. Show you AI generated possible Incident remediation based on your data + telemetry + code. All of this is 100% open-source.
I'm going to point out the obvious problem here: 42% RC identification is shit.<p>That means the first person on the call doing the triage has a 58% chance of being fed misinformation and bias which they have to distinguish from reality.<p>Of course you can't say anything about an ML model being bad that you are promoting for your business.