=== Introduction<p>Here I give some notes for a shorter,
simpler, more general, no calculus
derivatives derivation of the basics of
regression analysis.<p>=== The Given Data<p>Let R be the set of real numbers (can use
the set of rational numbers instead if
wish).<p>We are given positive integers m and n.<p>In our typing, character '^' starts a
superscript; character '_' starts a
subscript.<p>We regard R^m as the set of all m x 1
matrices (column vectors); that is, each
point in R^m is a matrix with m rows and 1
column with its components in set R.<p>We are given m x 1 y in R^m.<p>We are given m x n matrix A with
components in R.<p>We will need to know what matrix
<i>transpose</i> is: For an intuitive view,
the <i>transpose</i> of A is A^T where each
column of A becomes a row of A^T. Or to
be explicit, for i = 1 to m and j = 1 to
n, the component in row i and column j of
A becomes the component in row j and
column i of A^T.<p>We notice that each column of A is m x 1
and, thus, a point in R^m.<p>Suppose b is n x 1 with components in R.<p>Then the matrix product Ab is m x 1 in R^m
and is a <i>linear combination</i> (with
<i>coefficients</i> from b) of the columns of
A.<p>=== Real World<p>So we have data A and y. To get this
data, maybe we got some medical record
data on m = 100,000 people.<p>For each person we got data on age,
income, years of schooling, height,
weight, birth gender (0 for male, 1 for
female), years of smoking, blood sugar
level, blood cholesterol level, race (1 to
5 from white, black, Hispanic, Asian,
other), that is n = 10 variables. We get
to select these variables and call them
<i>independent</i>.<p>For person i = 1 to m = 100,000, we put
the value of variable j = 1 to n = 10 in
row i and column j of matrix A.<p>Or matrix A has one row for each person
and one column for each variable.<p>Vector y has systolic blood pressure: For
i = 1 to m = 100,000, row i of y, that is,
y_i, has the systolic blood pressure of
person i. We regard systolic blood
pressure in y as the <i>dependent</i> variable.<p>What we want to do is use our data to get
a mathematical function that for any
person, in the m = 100,000 or not, takes
in the values of the independent
variables, 1 x n v, and returns the
corresponding value of the dependent
variable 1 x 1 z. To this end we want n x
1 b so that we have<p>z = vb<p>So for any person, we take their
independent variable values v, apply
coefficients b, and get their dependent
variable value z.<p>We seek the b we will use on all people.<p>=== The <i>Normal Equations</i><p>So, we have our given m x n A and m x 1 y,
and we seek our n x 1 b.<p>We let<p>S = { Au | n x 1 u }<p>That is, S is the set of all the <i>linear
combinations</i> of the columns of A, a
<i>hyperplane</i> in R^m, a generalization of a
plane in three dimensions or a line in two
dimensions, a vector <i>subspace</i> of R^m,
the vector subspace of R^m <i>spanned</i> by
the columns of A.<p>We seek w in S to minimize the squared
<i>length</i><p>||y - w||^2<p>= \sum_{i = 1}^m (y_i - w_i)^2<p>(notation from D. Knuth's math typesetting
software TeX), that is the sum (capital
letter Greek sigma) from i = 1 to m of<p>(y_i - w_i)^2<p>where y_i is component i of m x 1 y in R^m
and similarly for w_i.<p>That is, we seek the w in the hyperplane S
that is closest to our dependent variable
value y.<p>Well, from some simple geometry, the
vector<p>y - w<p>has to be perpendicular to the hyperplane
S.<p>Also from the geometry, w is unique, the
only point in S that minimizes the squared
length<p>||y - w||^2<p>For the <i>geometry</i>, there is one classic
theorem -- in W. Rudin, <i>Real and Complex
Analysis</i> -- that will mostly do: In
quite general situations, more general
than R^m, every non-empty, closed, convex
set has a unique element of minimum norm
(length).<p>Now that we have the definitions and tools
we need, we derive the <i>normal equations</i>
in just two lines of matrix algebra:<p>Since (y - w) is perpendicular to each
column of A, we have that<p>A^T (y - w) = 0<p>where the right side is an n x 1 matrix of
0s.<p>Then<p>A^T y = A^T Ab<p>or with the usual writing order<p>(A^T A)b = A^T y<p>the <i>normal equations</i> where we have A and
y and solve for b.<p>=== Results<p>Since w is in S, the equations do have a
solution. Moreover, from the geometry, w
is unique.<p>If the n x n square matrix (A^T A) has an
inverse, then the solution, the b is also
unique.<p>Note: Vector w DOES exist and is unique;
b DOES exist; if the inverse of (A^T A)
exists, then b is unique; otherwise b
STILL exists but is not unique. STILL w
is unique. How 'bout that!<p>Since w is in S and since (y - w) is
perpendicular to all the vectors in S, we
have that<p>(y - w)^T w = 0<p>y^T y = (y - w + w)^T (y - w + w)<p>= (y - w)^T (y - w) + (y - w)^T w + w^T (y
- w) + w^T w<p>= (y - w)^T (y - w) + w^T w<p>or<p>[total sum of squares] = [regression sum
of squares] + [error sum of squares]<p>or the Pythagorean theorem.<p>So, now that we have found b, for any v,
we can get our desired z from vb.<p>The b we found works <i>best</i> in the least
squares sense for the m = 100,000 data we
had. If the 100,000 are an appropriate
<i>sample</i> of a billion, and if our b works
well on our sample, then maybe b will work
well on the billion. Ah, maybe could
prove some theorems here, theorems with
meager or no more assumptions than we have
made so far!<p>"Look, Ma, no calculus! And no mention of
the Gaussian distribution. No mention of
maximum likelihood.<p>Except for regarding the m = 100,000
<i>observations</i> as a <i>sample</i> from a
billion, we have mentioned no probability
concepts. We never asked for a matrix
inverse. And we never mentioned
<i>multicollinearity</i>."<p>And, maybe the inverse of n x n (A^T A)
does exist but is "numerically unstable".
Sooooo, we begin to suspect: The b we get
may be inaccurate but the w may still be
fine and our<p>z = vb<p>may also still be accurate. Might want to
look into this.<p>For a start, sure, the matrix (A^T A)
being numerically unstable is essentially
the same as the ratio of the largest and
smallest (positive, they are all positive)
eigenvalues being large. That is,
roughly, the problem is that the small
eigenvalues are, in the <i>scale</i> of the
problem, close to 0. Even if they are 0,
we have shown that our w is still unique.