TechEcho

1 comment

graycatover 10 years ago

Anomaly detection in large server farms and networks -- important problem. OP -- nice description of the problem!But, there's more!(1) Zero Day Problems.If we see a problem very often, maybe even just once, hopefully we make changes to protect against that problem so that the chances of seeing that problem again fall to nearly nothing.So, after we make such changes, net, the problems we really want to detect are the ones we've never seen before. That is, we want to detect problems when they are seen for the first time, on day zero, that is, zero day problems.So, in this situation, we have not yet seen any of the problems we are trying to detect, are trying to detect problems we have never seen before, and, thus, do not have data on those problems.And, for a problem we've seen before, even if we don't make changes to protect against that problem, a good detector for just that problem is usually a comparatively easy challenge and one that we should meet.(2) Good Data.Call a system from which we are collecting data our target system. So, if our target system is reasonably stable, then maybe we can collect data from that system for some hours, weeks, maybe months. If we let the data age and still see no symptoms of problems, then we regard this data, this history data, as from a healthy system, that is good data.(3) Hypothesis Test.So, continually in near real time, we collect data and do some calculation to raise an alarm or not.This work needs to be essentially a statistical hypothesis test performed continually as we receive data.In such a test, we tentatively assume that the target system is healthy. This assumption is our null hypothesis, that is, an assumption of a healthy system, nothing wrong, a null bad effect.Then we use the assumption of this null hypothesis, the good data, and the data we just received (observed) to calculate a number, we call a test statistic, and then calculate the probability, usually called alpha, of getting a test statistic that far from what it would have been for a healthy system. If that probability is too low to be reasonable, then we reject the null hypothesis that the system was healthy, conclude that the system is sick, and raise an alarm.So, alpha is the probability of getting such a bad value for the test statistic when the system is healthy. So, for a healthy system, alpha is the probability of raising a false alarm. Alpha is also commonly call Type I error.Then a missed detection of a real problem is Type II error, and commonly its probability is called beta. That is, beta is the probability of saying that the system is healthy when it is sick.The detection rate, the probability of saying the system is sick when it is, is one minus beta.Commonly we call alpha the rate of false alarms and beta the rate of missed detections.(4) Detector Quality.There can be many hypothesis tests. A perfect test is one with both alpha and beta zero; usually on the shelf of reality, the box of the perfect tests is empty.It's easy to have alpha, the rate of false alarms, be zero -- just turn off the detector. But then beta, the rate of missed detections, will be 1.It's easy to have beta be 0 -- just sound the alarm all the time. But then alpha will be 1.Generally, for a given detector that is not perfect, there is a trade-off -- the lower we insist that alpha be, the higher beta will be.But not all detectors are the same: Some detectors are better than others, that is, closer to being a perfect detector, that is, with a given alpha give a smaller beta, that is, a better trade-off. And detectors, even ones with the same alpha and beta, can differ on what real problems they detect.(5) Best Detector.The question of what would be the best possible detector was answered by the Neyman-Pearson result. So, for a given alpha, the best possible detector gets the lowest possible beta. A relatively general proof can be obtained from measure theory and, there, the Hahn decomposition from the Radon-Nikodym theorem.Alas, usually in practice, the Neyman-Pearson result asks for more data than we can have; in particular, when looking for zero-day problems we can't hope to use Neyman-Pearson.A high quality detector is one with a relatively low beta for its alpha. In practice a high quality detector saves money from chasing false alarm and the possibly serious problems of missed detections. Of course, the Neyman-Pearson result tells us how to create the highest quality detector possible.(6) Adjusting Alpha.Commonly in practice, we can select a value for alpha and have our detector obtain that value. So, in advance we can select the value we want for alpha and get that value in practice. But typically we have to get the corresponding value of beta by empirical means. Since when looking for zero day problems in a well run server farm or network we stand to get relatively few detections, we can have trouble getting an accurate estimate of beta.(7) Data Distribution.It can help use create a higher quality detector if we know the probability distribution of the data we observe when the system is healthy. As in the OP, maybe that distribution could be Gaussian, although with much time with real data from networks and server farms we expect to see Gaussian data only rarely.If we make no assumptions about the probability distribution of the data from a healthy target, then our statistical hypothesis test is distribution-free. For data from real server farms and networks, we are usually forced to use distribution-free tests. A special case of distribution-free is non-parametric.(8) Dimensions.It is common in practice, from one target system, to be able to collect data on each of several, say, n, variables at data rates from a point each few second up to some hundreds of points a second.Typically data from one variable is not independent of that from the other variables.So, with several variables, there is a multi-dimensional, n-dimensional, region, the critical region, such that we raise an alarm if and only if we get data in that region.For a high quality detector, that region should accurately fit where we want to raise an alarm -- the Neyman-Pearson result, when we have data enough to use it, can specify just what that region is.If our detector is based on just thresholds on the separate variables, then we are forced to have our critical region be just some n-dimensional box, and such a box gives us relatively little ability to get an accurate fit. With a poor fit, for our selected alpha, we stand to get a relatively high beta and, thus, lower detector quality.Or course, for our n-variables, the best detector that does good work with all n variables jointly will be the best detector we can have. Or, whatever can be done with the variables separately can also be done, along with more, in an n-dimensional detector.For an intuitive explanation, suppose n = 2 and we get data points on a checker board. Suppose a point on a red square indicates a healthy target and, a black square a sick one. If we consider the n = 2 variables separately, then we will have a low quality detector, but if we consider the n = 2 variables together, then we can have a perfect detector.With n-dimensional data, usually we have to give up on knowing the probability distribution of the data.(9) Old Techniques.Long the workhorse of server monitoring was thresholds. Later expert systems tried to use rules to determine when to raise an alarm, say,<pre><code> When I see A, B, and one of X, Y, or Z, it looks bad; raise an alarm. </code></pre> Here we had no idea of detector quality or false alarm rate and no ability to adjust false alarm rate. Or the work was necessarily statistical hypothesis testing except it was being done poorly.(10) Summary.The OP was correct that false alarms are bad but missed detections can be worse. So, we want high quality detectors, and for a given detector, to get the lowest rate of missed detections, that is, the highest detection rate, we can, we set the false alarm rate at the highest value the operating staff is willing to tolerate. Of course, in practice, if the false alarm rate is too high, the staff may just ignore the detector and its alarms thus giving a zero detection rate.So, what we want is a collection of statistical hypothesis tests that are both n-dimensional and distribution-free where we can select and know the false alarm rates and otherwise have good evidence of high quality detectors.All of this discussion is now quite old material. My conclusion for some years has been that people with large server farms and networks doing important work really should be interested but that nearly no one is.Apparently the OP has re-discovered this subject. Here I've tried to get everyone caught up as of some years ago!

1 comment

graycatover 10 years ago

Using Go for Anomaly Detection

1 comment

Using Go for Anomaly Detection

1 comment