https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-analyze-logs-for-root-cause<p>Disclosure: I’m one of the co-founders behind the ML mentioned here. However the results in the link above were generated independently by Cisco.<p>There have been a number of write-ups and papers evaluating ML for logs over the last few years – by academics, engineers, and observability vendors. But these results have typically been limited to synthetic data sets (lab generated), or very limited in scope (a handful of real world data sets).<p>The Automation and Innovation team at Cisco just published test results from a study applying ML to a large corpus of actual software incidents. The goal was to test the efficacy of the ML in picking out root cause indicators from logs that spanned the duration of the problems.<p>Each problem’s log data set was run through the Zebrium ML engine, AND was also independently root caused by an engineer with expertise in that software. To be counted as a success, the summary ML output had to contain the same log event(s) as those that were picked out by the engineer during the manual analysis. Further, each report had to be concise and clear enough that the evaluating engineer found it to be a net advantage (as compared with their familiar workflow of searching through logs).<p>The underlying software products were diverse in nature – Webex client (collaboration), DNA Center (networking), ISE (security), and UCS (infrastructure).<p>Each data set was a few GB in size (10s of millions of log events), spanned anywhere from a dozen to hundreds of log types (hence the challenge of correlating across logs), and typically contained a few days of logs. An engineer troubleshooting a software problem typically uses this history to understand the baseline of normal events, as well as spot any outlier events that might be clues to the root cause. The unsupervised ML was not pretrained – it used the same history to train on as was available in the log datasets.<p>Given the sensitivity of the data sets, the raw data is not published, but more details are expected in a follow up white paper. In any case, the results showed an accuracy rate of 95.8% for using ML for logs.