Anomaly detection in large server farms and networks
-- important problem. OP -- nice description of the
problem!<p>But, there's more!<p>(1) Zero Day Problems.<p>If we see a problem very often, maybe even just
once, hopefully we make changes to protect against
that problem so that the chances of seeing that
problem again fall to nearly nothing.<p>So, after we make such changes, net, the problems we
really want to detect are the ones we've never seen
before. That is, we want to detect problems when
they are seen for the first time, on <i>day zero</i>,
that is, <i>zero day</i> problems.<p>So, in this situation, we have not yet seen any of
the problems we are trying to detect, are trying to
detect problems we have never seen before, and,
thus, do not have data on those problems.<p>And, for a problem we've seen before, even if we
don't make changes to protect against that problem,
a good detector for just that problem is usually a
comparatively easy challenge and one that we should
meet.<p>(2) Good Data.<p>Call a system from which we are collecting data our
<i>target</i> system. So, if our target system is
reasonably <i>stable</i>, then maybe we can collect data
from that system for some hours, weeks, maybe
months. If we let the data <i>age</i> and still see no
symptoms of problems, then we regard this data, this
<i>history</i> data, as from a <i>healthy</i> system, that is
<i>good</i> data.<p>(3) Hypothesis Test.<p>So, continually in near real time, we collect data
and do some calculation to raise an alarm or not.<p>This work needs to be essentially a statistical
<i>hypothesis test</i> performed continually as we
receive data.<p>In such a test, we tentatively assume that the target
system is healthy. This assumption is our <i>null
hypothesis</i>, that is, an assumption of a healthy
system, nothing wrong, a <i>null</i> bad effect.<p>Then we use the assumption of this null hypothesis,
the <i>good</i> data, and the data we just received
(observed) to calculate a number, we call a <i>test
statistic</i>, and then calculate the probability,
usually called alpha, of getting a test statistic
that far from what it would have been for a healthy
system. If that probability is too low to be
reasonable, then we reject the null hypothesis that
the system was healthy, conclude that the system is
sick, and raise an alarm.<p>So, alpha is the probability of getting such a bad
value for the test statistic when the system is
healthy. So, for a healthy system, alpha is the
probability of raising a false alarm. Alpha is also
commonly call Type I error.<p>Then a missed detection of a real problem is Type II
error, and commonly its probability is called beta.
That is, beta is the probability of saying that the
system is healthy when it is sick.<p>The <i>detection rate</i>, the probability of saying the
system is sick when it is, is one minus beta.<p>Commonly we call alpha the <i>rate</i> of false alarms and
beta the <i>rate</i> of missed detections.<p>(4) Detector Quality.<p>There can be many hypothesis tests. A <i>perfect</i>
test is one with both alpha and beta zero; usually
on the shelf of reality, the box of the perfect
tests is empty.<p>It's easy to have alpha, the rate of false alarms,
be zero -- just turn off the detector. But then
beta, the rate of missed detections, will be 1.<p>It's easy to have beta be 0 -- just sound the alarm
all the time. But then alpha will be 1.<p>Generally, for a given detector that is not perfect,
there is a trade-off -- the lower we insist that
alpha be, the higher beta will be.<p>But not all detectors are the same: Some detectors
are better than others, that is, <i>closer</i> to being a
perfect detector, that is, with a given alpha give a
smaller beta, that is, a better trade-off. And
detectors, even ones with the same alpha and beta,
can differ on what real problems they detect.<p>(5) Best Detector.<p>The question of what would be the best possible
detector was answered by the Neyman-Pearson result.
So, for a given alpha, the best possible detector
gets the lowest possible beta. A relatively general
proof can be obtained from measure theory and,
there, the Hahn decomposition from the Radon-Nikodym
theorem.<p>Alas, usually in practice, the Neyman-Pearson result
asks for more data than we can have; in particular,
when looking for zero-day problems we can't hope to
use Neyman-Pearson.<p>A high <i>quality</i> detector is one with a relatively
low beta for its alpha. In practice a high quality
detector saves money from chasing false alarm and
the possibly serious problems of missed detections.
Of course, the Neyman-Pearson result tells us how to
create the highest quality detector possible.<p>(6) Adjusting Alpha.<p>Commonly in practice, we can select a value for
alpha and have our detector obtain that value. So,
in advance we can select the value we want for alpha
and get that value in practice. But typically we
have to get the corresponding value of beta by
empirical means. Since when looking for zero day
problems in a well run server farm or network we
stand to get relatively few detections, we can have
trouble getting an accurate estimate of beta.<p>(7) Data Distribution.<p>It can help use create a higher quality detector if
we know the probability distribution of the data we
observe when the system is healthy. As in the OP,
maybe that distribution could be Gaussian, although
with much time with real data from networks and
server farms we expect to see Gaussian data only
rarely.<p>If we make no assumptions about the probability
distribution of the data from a healthy target, then
our statistical hypothesis test is
<i>distribution-free</i>. For data from real server
farms and networks, we are usually forced to use
distribution-free tests. A special case of
<i>distribution-free</i> is <i>non-parametric</i>.<p>(8) Dimensions.<p>It is common in practice, from one target system, to
be able to collect data on each of several, say, n,
variables at data rates from a point each few second
up to some hundreds of points a second.<p>Typically data from one variable is not independent
of that from the other variables.<p>So, with several variables, there is a
multi-dimensional, n-dimensional, region, the
<i>critical</i> region, such that we raise an alarm if
and only if we get data in that region.<p>For a high quality detector, that region should
accurately fit where we want to raise an alarm --
the Neyman-Pearson result, when we have data enough
to use it, can specify just what that region is.<p>If our detector is based on just <i>thresholds</i> on the
separate variables, then we are forced to have our
critical region be just some n-dimensional box, and
such a box gives us relatively little ability to get
an accurate <i>fit</i>. With a poor fit, for our
selected alpha, we stand to get a relatively high
beta and, thus, lower detector quality.<p>Or course, for our n-variables, the best detector
that does good work with all n variables jointly
will be the best detector we can have. Or, whatever
can be done with the variables separately can also
be done, along with more, in an n-dimensional
detector.<p>For an intuitive explanation, suppose n = 2 and we
get data points on a checker board. Suppose a point
on a red square indicates a healthy target and, a
black square a sick one. If we consider the n = 2
variables separately, then we will have a low
quality detector, but if we consider the n = 2
variables together, then we can have a perfect
detector.<p>With n-dimensional data, usually we have to give up
on knowing the probability distribution of the data.<p>(9) Old Techniques.<p>Long the workhorse of server monitoring was
thresholds. Later <i>expert systems</i> tried to use
<i>rules</i> to determine when to raise an alarm, say,<p><pre><code> When I see A, B, and one of
X, Y, or Z, it looks bad;
raise an alarm.
</code></pre>
Here we had no idea of detector quality or false
alarm rate and no ability to adjust false alarm
rate. Or the work was necessarily statistical
hypothesis testing except it was being done poorly.<p>(10) Summary.<p>The OP was correct that false alarms are bad but
missed detections can be worse. So, we want high
quality detectors, and for a given detector, to get
the lowest rate of missed detections, that is, the
highest detection rate, we can, we set the false
alarm rate at the highest value the operating staff
is willing to tolerate. Of course, in practice, if
the false alarm rate is too high, the staff may just
ignore the detector and its alarms thus giving a
zero detection rate.<p>So, what we want is a collection of statistical
hypothesis tests that are both n-dimensional and
distribution-free where we can select and know the
false alarm rates and otherwise have good evidence
of high quality detectors.<p>All of this discussion is now quite old material.
My conclusion for some years has been that people
with large server farms and networks doing important
work really should be interested but that nearly no
one is.<p>Apparently the OP has re-discovered this subject.
Here I've tried to get everyone caught up as of some
years ago!