In the OP, the<p>> The "bug" here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.<p>can be viewed as pointing to the solution:<p>Broadly in computing, we have 'resource limits'
from the largest integer that can be stored
in 2's complement in 32 bits to disk partition
size, maximum single file size, number of IP
ports, etc. Mostly we don't pay really careful
attention all of these resource limits.<p>Well, for 'systems' such as are being considered
in the OP, do consider the resource limits. So,
for each such limit, have some software track
its usage and then report the usage and especially
when usage of that resource is close to be exhausted.
And when the resource is exhausted, then report that
the problem was the resource was exhausted and
who the users of that resource were.<p>For a little more, can make some progress by
doing simulations of 'networks of queues' with
whatever stochastic processes want to assume
for 'work arrival time' and
'work service times'.<p>For more, do some real time system monitoring.
For problems that are well understood and/or
have been seen before, e.g., running out of
a critical resource, just monitor for the
well understood aspects of the problem.<p>Then
there are problems never seen before and
not yet well understood. Here monitoring is
essentially forced to be a continually applied
'statistical hypothesis test' for the 'health
and wellness' of the system complete with
rates of 'false positives' (false alarms) and
rates of 'false negatives' (missed detections
of real problems).<p>For the operational definition of 'healthy and
well', collect some operational data, let it
'age' until other evidence indicates that
the system was 'healthy and well' during
that time, and, then, use that data as the
definition. Then decree and declare that any
data 'too far' from the healthy and well data
indicates 'sick'.<p>Then as usual in statistical hypothesis testing,
want to know the false alarm rate in advance
and want to be able to adjust it. And if
get a 'detection', that is, reject the 'null
hypothesis' that the system is healthy and
well, then want to know the lowest false alarm
rate at which the particular data observed
in real time would still be a 'detection'
and use that false alarm rate as a measure
of the 'seriousness' of the detection.<p>If make some relatively weak
probabilistic assumptions about the data,
then, in calculating the false alarm rate,
can get some probabilistically 'exchangeable'
data. Then get to apply some group theory
(from abstract algebra) and borrow a little
from classic ergodic theory to calculate
false alarm rate. And can use a classic
result in measure theory by S. Ulam,
that the French probabilist LeCam
nicely called
'tightness', to
show that the hypothesis test is not
'trivial'. See Billingsly 'Convergence
of Probability Measures'.<p>Of course, the 'most powerful' test for
each given false alarm rate is from the
Neyman-Pearson lemma (the elementary
proofs are not much fun, but there is
a nice proof starting with the
Hahn decomposition, a fast result
from the Radon-Nikodym theorem),
but for problems
never seen before we do not have data
enough to use that result.<p>For the statistical hypothesis test,
we need two special properties: (1)
Especially for monitoring systems,
especially complex or distributed systems,
we should have statistical hypothesis
tests that are multi-dimensional (multi-variate);
that is, if we are monitoring once a second,
then each second we should get data on each
of a certain n variables for some positive
integer n. So, we are working with n variables.
(2) As U. Grenander at Brown U. observed
long ago, operational data from computer
systems is probabilistically, certainly
in distribution, quite different from data from most of
the history of statistics. He was correct.
Especially since we are interested in data
on several variables, we have no real hope
of knowing the distribution of the data even
when the system is 'healthy and well' and
still less hope when it is sick with a problem
we have never seen before. So, we need
hypothesis tests that are 'distribution-free',
that is, make no assumptions about probability
distributions. So, we are faced with calculating
false alarm rates for multi-dimensional data
where we know nothing about the probability
distribution.<p>There is a long history of hypothesis tests for
from a
single variable, including many tests that are
distribution-free. See old books by
S. Siegel,
'Nonparametric Statistics for
the Behavioral Sciences' or
E. Lehmann, 'Nonparametrics'.<p>For multi-dimensional
hypothesis tests, there's not much
and still less when also distribution-free.<p>Why multi-dimensional? Because for the
two interacting systems in the example in
the OP, we guess that to detect the problem
we would want data from both systems. More
generally in a large server farm or network,
we are awash in variables on which we can
collect data at rates from thousands of
points per second down to a point each
few seconds. So, for n dimensions, n can
easily be dozens, hundreds, ....<p>For such tests, look at "anomaly
detector" in
'Information Sciences' in, as I
recall, 1999.<p>If want to implement such a test, might
want to read about k-D trees in, say,
Sedgewick. Then think about
backtracking in the tree, depth-first,
and accumulating some cutting planes.<p>For monitoring, there was long some work
by Profs Patterson and Fox at Stanford
and Berkelely, in their RAD Lab, funded
by Google, Microsoft, and Sun. The
paper in 'Information Sciences' seems
to be ahead.<p>More can be done. E.g., consider the
Rocky Mountains and assume that they
are porous to water. Let the mountains
be the probability distribution of
two-variable data when the system is
healthy and well. Now pour in water
until, say, the lowest 1% of the probability
mass has been covered. Now the test is,
observe a point in the water and raise
an alarm. Now the false alarm rate is
1%. And the area of false alarms is the
largest we can make it for being 1%
(a rough surrogate for the best detection rate --
with a fast use of Fubini's theorem in
measure theory, can say more here).
As we know, lakes can have fractal boundaries,
and the techniques in the paper will,
indeed, approximate those. Also the techniques
do not require that all the lakes be
connnected. And for the dry land, it need
not all be connected either and might
be in islands. So, it is quite general.<p>But may want to assume that the region of
healthy and well performance is a convex
set and try again. If this assumption
is correct, then for a given false alarm
rate will get a higher detection rate,
that is, a more 'powerful' test.<p>Still more can be done. But, really,
the work is essentially some applied
math with, at times, some moderately
advanced prerequisites from pure and
applied math. Theme: The crucial
content of the most powerful future
of computing is in math, not computer
science. Sorry 'bout that!