It appears that we have a special,
powerful, valuable opportunity for how to
manipulate the data in the OP.<p>So, the OP has "7 Leading Fraud
Indicators: From Fresh Cookies to Null
Values".<p>Suppose for those 7 indicators, 4 of them
have just two possible values and the
other three have just 4 possible values or
some such. Then for one connection to the
server from a Web browser, the 7 signals
have jointly just<p><pre><code> 2^4 * 4^3 = 1,024
</code></pre>
possible values. That is, there are only
n = 1,024 possible <i>cases</i> of signal data
from a Web browser from a connection to
the server. And apparently we have good
data on each of the cases.<p>Or, to be practical, if for some case we
have no data at all, then we just assume
that the reason is that the probability of
that case is so low that we can ignore
that case.<p>The central problem here is how to detect
"fraudsters". For such detection,
necessarily there are two ways to be
wrong: (1) a false alarm when we say that
a connection is from fraud when it is not
and (2) a missed detection when we say
that a connection is not from fraud when
it is.<p>Our mission, and we have to accept it, is
essentially to find ways of manipulating
the large amount of relevant data so that
(A) from the false alarm (1), we can
specify the highest probability of a false
alarm f we are willing to tolerate, (B)
get that probability of a false alarm f in
practice, and (C) from the missed
detections in (2), for that probability of
a false alarm f, get the lowest
probability of a missed detection (2) we
can.<p>Or, for the false alarms we are willing to
tolerate, we want to manipulate the data
to get all the detections we can.<p>So, for some notation:<p>P -- probability<p>n -- positive integer, number of different
possible cases of data from connections,
e.g., as above, n = 1,024<p>B -- event, connection is bad, fraud<p>G -- event, connection is good, not fraud<p>P(B) + P(G) = P(B OR G) = 1<p>C -- random variable, case of connection,
i = 1, 2, ..., n.<p>So random variable C takes values in the
set {1, 2, ..., n}.<p>p(i) = P(C = i)<p>b(i) = P(B | C = i) = P(B AND C = i)/P(C = i)<p>= P(B AND C = i)/p(i)<p>g(i) = P(G | C = i) = P(G AND C = i)/P(C = i)<p>= P(G AND C = i)/p(i)<p>B = U_i {B AND C = i}<p>P(B) = Sum_i P(B AND C = i)<p>= Sum_i p(i) P(B | C = i)<p>= Sum_i p(i) b(i)<p>P(G) = Sum_i p(i) g(i)<p>b(i) + g(i) = P(B | C = i) + P(G | C = i)<p>= P(B AND C = i)/P(C = i) + P(G AND C = i)/P(C = i)<p>= ( P(B AND C = i) + P(G AND C = i) )/P(C = i)<p>= P( (B AND C = i) OR (G AND C = i) )/P(C = i)<p>= P(C = i)/P(C = i) = 1<p>M -- event, a missed detection of a bad
connection, fraud<p>D -- event, detection of a bad connection, fraud<p>F -- event, false alarm<p><i>Detection Rule:</i><p>Suppose for some set I a subset of {1, 2,
..., n} we raise an alarm of a detection
of a bad connection, that is, fraud, when
C in I.<p>With this detection rule, probability of a
false alarm is<p>P(F) = Sum_{C in i} P(G AND C = i)<p>= Sum_{C in i} P(G | C = i) p(i)<p>= Sum_{C in i} g(i) p(i)<p>the probability of a detection is<p>P(D) = Sum_{i in I} P(B AND C = i)<p>= Sum_{i in I} P(B | C = i) p(i)<p>= Sum_{i in I} b(i) p(i)<p>and the probability of a missed detection
is<p>P(M) = P(B AND C not in I)<p>= Sum_{j not in I} P(B AND C = j)<p>= Sum_{j not in I} P(B | C = j) p(j)<p>= Sum_j P(B | C = j) p(j)<p>- Sum_{i in I} p(B | C = i) p(i)<p>= Sum_j P(B | C = j) p(j) - P(D)<p>= Sum_j P(B AND C = j) - P(D)<p>= P(B) - P(D)<p>So, to minimize the probability of a
missed detection P(M) we want to maximize
the probability of a detection P(D). We
guessed this intuitively.<p>To maximize the probability of a detection
P(D), suppose we have sorted our data on
the n cases so that the ratios b(i)/g(i)
are in descending order, that is, so that<p>b(1)/g(1) >= b(2)/g(2) >= ... >= b(n)/g(n)<p>Suppose we pick k in {1, 2, ..., n} and let I
= {1, 2, ..., k}.<p>Then for our detection rule with this k
and I, the probability of a false alarm is<p>P(F) = Sum_{i in I} g(i) p(i)<p>So, note that here really we are just
summing i = 1, 2, ..., k where, as just
above,<p>b(1)/g(1) >= b(2)/g(2) >= ... >= b(n)/g(n)<p>So, we just sort these ratios and then sum
the products g(i) p(i) on i until we get
our selected probability of false alarms
f.<p>As we will prove below, this is just the
thing we should do.<p>If we pick k too large, then our
probability of false alarms will be larger
than our selected value f. If we pick k
too small, then our probability of
detection will be smaller than we want.<p>Also for our detection rule with this k
and I, the probability of a detection,
what we want to maximize, is<p>P(D) = Sum_{i in I} b(i) p(i)<p>So, suppose we pick k just large enough
that P(F) = f (or close enough for
government work).<p>Claim: With this selection of k and I, we
get, as in (1), the probability of a false
alarm f we selected and, for that
probability of a false alarm f, get the
probability of a detection P(D) the
largest possible and, as in (2) the
probability of a missed detection the
smallest possible.<p>To see this claim, we want to select x_1,
x_2, ..., x_n to solve the <i>operations
research applied mathematics resource
allocation optimization problem</i><p>Problem 1:<p>max z = P(D) = Sum_i x_i b(i) p(i)<p>subject to<p>P(F) = Sum_i x_i g(i) p(i) <= f<p>x_i = 0, 1<p>Yes, from the x_i, I = {i | x_i = 1}.<p>Problem 2:<p>Suppose for some L >= 0, x = (x_i) solves<p>max z = Sum_i x_i b(i) p(i)<p>- L ( Sum_i x_i g(i) p(i) - f )<p>subject to<p>x_i = 0, 1<p>Then since x = (x_i) solves Problem 2, we
have that for any y = (y_i) that satisfies
the constraints of Problem 1, that is<p>Sum_i y_i g(i) p(i) <= f<p>and<p>y_i = 0, 1<p>we have that<p>Sum_i x_i b(i) p(i)<p>= Sum_i x_i b(i) p(i)<p>- L ( Sum_i x_i g(i) p(i) - f )<p>>= Sum_i y_i b(i) p(i)<p>- L ( Sum_i y_i g(i) p(i) - f )<p>>= Sum_i y_i b(i) p(i)<p>so that x = (x_i) solves Problem 1.<p>For more, in Problem 2, we have<p>max z = Sum_i x_i b(i) p(i)<p>- L ( Sum_i x_i g(i) p(i) - f )<p>= Sum_i ( x_i b(i) p(i) - L x_i g(i) p(i) )<p>- L f<p>= Sum_i ( x_i ( b(i) p(i) - L g(i) p(i) ) )<p>- L f<p>so that x_i = 1 if and only if<p>x_i ( b(i) p(i) - L g(i) p(i) ) >= 0<p>and<p>b(i)/g(i) >= L<p>So, the way to solve Problem 1 is to pick
k = 1, 2, ..., n, set I = {1, 2, ..., k},
and set x_i = 1 for i in I and x_i = 0
otherwise so that<p>Sum_{i in I} x_i g(i) p(i) = f<p>In particular,<p>L = b(k)/g(l)<p>That is, intuitively, we are making
investments in real estate, our
probability of a false alarm<p>P(F) = Sum_i x_i g(i) p(i) = f<p>is like money.<p>We get to invest the money in cases i = 1,
2, ..., n. For case i,<p>b(i)/g(i) = (b(i)p(i))/(g(i)p(i))<p>is our return on investment, that is, at
investment i, the probability of detection
we get for the probability of false alarms
we are willing to tolerate.<p>So, we sort so that the ratios<p>b(1)/g(1) >= b(2)/g(2) >= ... >= b(n)/g(n)<p>and make investments in the order i = 1,
2, ... until we have spent all our money.<p>Then, for the money we have spent, that is
the best return on our investment we can
get.<p>Thanks to J. Lagrange, K. Pearson, J.
Neyman, and H. Everett.