I asked that early in my career.<p>We want a metric essentially because
if we converge or have a good
approximation in the metric
then we are close in some important
respects.<p>Squared error, then, gives one such
metric.<p>But for some given data, usually there are several metrics we might use, e.g., absolute error (L^1), worst case error (L^infinity), L^p for positive integer p, etc.<p>From 50,000 feet up, the reason for using squared error is that get to have the Pythagorean theorem, and, more generally, get to work in a Hilbert space, a relatively nice place to be, e.g., we also get to work with angles from inner products, correlations, and covariances -- we get cosines and a version of the law of cosines. E.g., we get to do orthogonal
projections which give us minimum
squared error.<p>With Hilbert space, commonly we
can write the total error
as a sum of contributions
from orthogonal
components, that is, decompose
the error into contributions
from those components -- nice.<p>The Hilbert space we get from squared error gives us the nicest version of Fourier theory, that is, orthogonal
representation and decomposition,
best squared error approximation.<p>We also like Fourier theory with
squared error
because
of how it gives us the Heisenberg
uncertainty principle.<p>Under meager assumptions, for
real valued random variables
X and Y, E[Y|X], a function of X,
is the best
squared error approximation
of Y by a function of X.<p>Squared error gives us variance, and
in statistics sample mean and variance are <i>sufficient statistics</i> for the Gaussian; that is, for statistics, for Gaussian data, can take the sample mean and sample variance, throw away the rest of the data, and do just as well.<p>For more, convergence in squared error can imply convergence almost surely at least for a subsequence.<p>Then there is the Hilbert space result, every nonempty, closed, convex subset has a unique element of minimum norm (from squared error) -- nice.