Okay, PG has an <i>hypothesis</i> test.<p>There's a large literature for that, e.g.,<p>E. L. Lehmann, <i>Testing Statistical
Hypotheses</i>.<p>E. L. Lehmann, <i>Nonparametrics:
Statistical Methods Based on Ranks</i>.<p>Sidney Siegel, <i>Nonparametric Statistics
for the Behavioral Sciences</i>.<p>In this case, PG will be more interested
in the <i>non-parametric</i> case, i.e.,
<i>distribution-free</i> where we make no
assumptions about probability
distributions.<p>We start an hypothesis test with an
<i>hypothesis</i>, commonly called the <i>null
hypothesis</i> which is an assumption that
there is no <i>effect</i> or, in PG's case, <i>no
bias</i>. Then with that assumption, we are
able to do some probability calculations.<p>Then we look at the real data and
calculate the probability of, say, the
evidence of bias being as large as we
observed. If that probability is small,
say, less than 1%, then we <i>reject</i> the
<i>null hypothesis</i>, that is, reject the
assumption of no <i>bias</i>, and conclude that
the null hypothesis is false and that
there is bias. The role of the assumption
about the sample is so that we know that
the <i>problem</i> is bias and not something
about the sample.<p>In hypothesis testing, about all that
matters are just two numbers -- the
probability of Type I error and that of
Type II error. We want both probabilities
to be as low as possible.<p>Type I Error: We reject the null
hypothesis when it is true, e.g., we
conclude bias when there is none.<p>Type II Error: We fail to reject (i.e.,
we accept) the null hypothesis when it is
false.<p>When looking for bias, Type I error can be
called a <i>false alarm</i> of bias, and Type
II error can be called a <i>missed
detection</i> of bias.<p>In PGs case, suppose we have 100 startups
and five of those have women founders.
Suppose for each of the startups we have
the data from "their subsequent
performance is measured".<p>Our null hypothesis is that the expected
performance of the women is the same as
that of the men.<p>So, let's find those two averages and take
the difference, say, the average of the
women less the average of the men.<p>PG says if this difference is positive,
then there was bias, but PG has not given
us any estimate of the probability of Type
I error, that is, of the probability (or
<i>rate</i>) of a false alarm.<p>I mean we don't want to get First Round
Capital in trouble with Betty Friedan,
Gloria Steinem, Marissa Mayer, Sheryl
Sandberg, Hillary Clinton, Ivanka Trump,
or Lady Gaga unjustly! :-).<p>Let's call this difference our <i>test
statistic</i>.<p>So, let's find the probability of a false
alarm:<p>So, let's put all 100 measurements in a
pot, stir the pot vigorously (we can use a
computer for this), pull out five numbers
and average, pull out the other 95 numbers
and average, take the difference in the
two averages, that of the five less that
of the 95, and do this, say, 1000 times.
Ah, computers are cheap; let's be generous
and do this 10,000 times.<p>For a random number, how about starting
with a 32 bit integer, with appropriately
long precision arithmetic multiply by
5^15, add 1, take modulo 2^47, and scale
as we want?<p>So, we get an empirical distribution of
these differences, from the five less the
95. Looking at the distribution, we see
what the probability is of getting a
difference as high or high or higher than
our test statistic. If that probability
is low, say, 1% or less, then we reject
the null hypothesis of no bias and
conclude bias with our estimate of
probability of Type I error 1% or less.<p>If with the 1% we reject, then it looks
like First Round has done a transgression,
will get retribution from Betty, <i>et al.,</i>
and needs to seek redemption and Betty,
<i>et al.,</i> are happy to have their
suspicions confirmed. Else First Round
looks like the good guys, are "certified
statistically fair to women", may get more
deal flow from women, and Betty, <i>et al.,</i>
can be happy that First Round is so nice!<p>Notice that either way Betty, <i>et al.,</i>
are "happy". That's called "happy women,
happy life"! Or, heads, the women win,
tails they lose, and in no event is there
a huge crowd of angry women in front of
First Round's offices with a bonfire of
lingerie screaming "bias"!<p>When we reject the null hypothesis, we
want to know that the reason was men
versus women and not something else, e.g.,
a <i>biased</i> sample. So here is where we
use our assumption of independence with
the same mean.<p>Now we have a <i>handle</i> on Type I error.<p>Here we have done a <i>non-parametric</i>
statistical hypothesis test, i.e., have
made no assumptions, except the means,
about the distributions of the male/female
CEO performance measurements.<p>And we can select our desired false alarm
rate in advance and get that rate almost
exactly.<p>For Type II error, that is more difficult.<p>Bottom line, what we really want is, for
whatever rate of false alarms we are
willing to tolerate, the lowest rate of
missed detections we can get.<p>Can we do that? With enough more data,
yup. There is a classic result due to J.
Neyman (long at Berkeley) and K. Pearson
(early in statistics) that shows how.<p>How? Regard false alarm rate as money and
think of investing in SF real estate. We
put our money done on the opportunities
with highest expected ROI until we have
spent all our money. Done. For details,
an unusually general proof can follow from
the Hahn decomposition from the
Radon-Nikodym theorem in measure theory,
e.g., Rudin, <i>Real and Complex Analysis</i>.
Right, in the discrete case, we have a
knapsack problem, known to be in
NP-complete.<p>What we have done with our pot stirring is
called <i>resampling</i>, and for more such
look for B. Efron, long at Yale, and P.
Diaconis, once at Harvard, now long at
Stanford.<p>Tom, with a reputation as a hacker, likes
to work late, say, till 2 AM. So, we look
at the intrusion alerts each minute
between 2 AM and 3 AM (something like the
performance of the women) and compare with
those of the other minutes of 24 hours
(like the performance of the men) much as
above and ask if Tom is trying to hack the
servers.<p>Or, we have a server farm and/or a
network, and we want to detect problems
never seen before, e.g., <i>zero day</i>
problems. So, we have no data at all on
the problems we are trying to detect
because we have never seen any of those
before.<p>So, to do a good job, let's pick some
system we want to monitor and for that
system, get data on, say, each of 10
variables at, say, 20 times a second. Now
what?<p>Our work with bias in women venture
applications used just one number for our
measurement and test statistic. So we
were <i>uni-dimensional</i>. Here we have 10
numbers and need to be
<i>multi-dimensional.</i><p>Well, in principle we should be able to do
much better (pair of Type I and Type II
error rates) with 10 numbers than just
one. The usual ways will require us to
have, with our null hypothesis, the
probability distribution of the 10
numbers, but can only get something like
that from smoking funny stuff -- not even
<i>big data</i> is that big.<p>So, we want to need no assumptions about
distribution, that is, be
<i>distribution-free</i>.<p>So, we want some statistical a hypothesis
test that is both multi-dimensional and
distribution free.<p>Can we do that? Yup.<p>"You mean you can select false alarm rate
in advance and get that rate essentially
exactly, as in PG's bias example?" Yup.<p>"Could that be used in a real server farm
or network to detect zero day problems --
security, performance, hard/software
failures, system management errors?" Yup
-- just what it was invented for.<p>"Attempted credit card fraud?" Ah, once a
guy in an audience thought so!<p>How? Ah, sadly there is no more room in
this post!<p>What else might we do with hypothesis
tests? Well, look around at, right, <i>big
data</i> or just <i>small data</i>.<p>Do we have a case of <i>big data analytics</i>
or <i>artificial intelligence</i> (AI)?<p>Ah, I've given a sweetheart outline of
statistical hypothesis testing, and now
you are suggesting some things really low
grade? Where did I go wrong to deserve
such an insult?