Why squared error? (2014)

256 点作者 rpbertp13超过 8 年前

23 条评论

eanzenberg超过 8 年前

Why squared error? Because you can solve the equation to minimize squared error using linear algebra in closed form.Why L2 regularization? Same reason. A closed form solution exists from linear algebra.But at the end of the day, you are most interested in the expectation value of the coefficient and minimizing the squared error gives you E[coeffs] which is the mean of the coefficients.

评论 #13034030 未加载

评论 #13033315 未加载

评论 #13033979 未加载

throw_away_777超过 8 年前

There is a Kaggle competition right now that uses mean absolute error, and this makes the problem substantially harder. For a practical discussion of techniques used to solve machine learning problems that use mae see the forums in: <a href="https://www.kaggle.com/c/allstate-claims-severity/forums" rel="nofollow">https://www.kaggle.com/c/allstate-claims-severity/forums</a>As touched upon in the article, the objective not being differentiable is a big deal for modern machine learning methods.

评论 #13032998 未加载

评论 #13033196 未加载

评论 #13032941 未加载

评论 #13032885 未加载

评论 #13033203 未加载

gpsx超过 8 年前

For minimizing the square of the errors I think the good reason is because, assuming your data has gaussian probability distribution, minimizing the square error corresponds to maximizing the likelihood of the measurement, as you and others have said.Why do we assume gaussian errors? There is seldom a gaussian distribution in the real world usually because the probability for large error values doesn't not decay that fast. We use it because the math is easy and we can actually solve the problem assuming that.

评论 #13032700 未加载

评论 #13032685 未加载

tvural超过 8 年前

The best explanation is probably that squared error gives you the best fit when you assume your errors should normally distributed.Things like the fact that squared error is differentiable are actually irrelevant - if the best model is not differentiable, you should still use it.

评论 #13033249 未加载

评论 #13033513 未加载

评论 #13033195 未加载

graycat超过 8 年前

I asked that early in my career.We want a metric essentially because if we converge or have a good approximation in the metric then we are close in some important respects.Squared error, then, gives one such metric.But for some given data, usually there are several metrics we might use, e.g., absolute error (L^1), worst case error (L^infinity), L^p for positive integer p, etc.From 50,000 feet up, the reason for using squared error is that get to have the Pythagorean theorem, and, more generally, get to work in a Hilbert space, a relatively nice place to be, e.g., we also get to work with angles from inner products, correlations, and covariances -- we get cosines and a version of the law of cosines. E.g., we get to do orthogonal projections which give us minimum squared error.With Hilbert space, commonly we can write the total error as a sum of contributions from orthogonal components, that is, decompose the error into contributions from those components -- nice.The Hilbert space we get from squared error gives us the nicest version of Fourier theory, that is, orthogonal representation and decomposition, best squared error approximation.We also like Fourier theory with squared error because of how it gives us the Heisenberg uncertainty principle.Under meager assumptions, for real valued random variables X and Y, E[Y|X], a function of X, is the best squared error approximation of Y by a function of X.Squared error gives us variance, and in statistics sample mean and variance are sufficient statistics for the Gaussian; that is, for statistics, for Gaussian data, can take the sample mean and sample variance, throw away the rest of the data, and do just as well.For more, convergence in squared error can imply convergence almost surely at least for a subsequence.Then there is the Hilbert space result, every nonempty, closed, convex subset has a unique element of minimum norm (from squared error) -- nice.

评论 #13034657 未加载

shawnz超过 8 年前

I am no math expert, but I have always thought about it like this. The squared error is like weighting the error by the error. This causes one big error to be more significant than many small errors, which is usually what you want. Am I on the right track?

评论 #13033124 未加载

评论 #13034832 未加载

评论 #13035344 未加载

kazinator超过 8 年前

Squared error represents the underlying belief that errors in various dimensions, or errors in independent samples, are linearly independent. So they add together like orthogonal vectors, forming a vector whose length is the square root of the sum of the squares. Minimizing the square error is a way of minimizing that square root without the superfluous operation of calculating it.

thomasahle超过 8 年前

It's fine to list some reasons for using squared error, but you really can't decide on the error function without referring to a problem you're trying to solve.Just look at the success of compressed sensing, based on taking the absolute value error seriously.

评论 #13037009 未加载

dnautics超过 8 年前

"inner products/gaussians" - the absolute value (and also cuberoot of absolute cubes, fourth root of fourth powers) also define inner products. Likewise, there are "gaussian-like formulas" which take these powers instead of squared.However: if you look at the shape of the squareroot of sum squares, it's a circle, so you can rotate it. If you take the absolute, it's a square, so that cannot be rotated; the cuberoot of cubes and fourthroot of fourths, etc. look like rounded edge squares, and that cannot be rotated either, so if you have a change of vector basis, you're out of luck.With the gaussian forms of other powers, none of them have the central limit property.

评论 #13032631 未加载

j7ake超过 8 年前

The Bayesian formulation for the likelihood function would make this squared error explicitly clear.

评论 #13032423 未加载

评论 #13032478 未加载

TeMPOraL超过 8 年前

My explanation for squared error in linear approximation always was: because it minimizes the thickness of the line that passes through all the data points.(Per the old math joke - you can make a line passing through any three points on a plane if you make it thick enough.)

theophrastus超过 8 年前

Or why use variances when there are standard deviations (the square root of the variance) which have more easily interpreted units? One commonly cited reason is that one can sum variances from different factors, which one cannot do with standard deviations. There are other properties of variances which make them more suitable for continued calculations[1]. This is why, for instance, variances are often utilized in automated optimization packages.[1] <a href="https://en.wikipedia.org/wiki/Variance#Properties" rel="nofollow">https://en.wikipedia.org/wiki/Variance#Properties</a>

bagrow超过 8 年前

Interesting discussion. Not sure about the breakdown between ridge regression and LASSO though. The difference is not in the error term but in the regularization term.

thisrod超过 8 年前

Squared error because the uncertainties in independent, normally distributed random variables add in quadrature. I expect that this could be proved geometrically using Pythagoras's theorem, so in that sense the comments about orthogonal axes are vaguely on the right track.Normally distributed variables because the central limit theorem.It isn't all that complicated.

jostmey超过 8 年前

Why not KL-Divergence, which measures the error between a target distribution and the current distribution? From the perspective of Information Theory, it is the best error measurement.Oh, and let's not forget that for a lot of problems minimizing the KL-divergence is the exact same operation as maximizing the likelihood function.

评论 #13034231 未加载

highd超过 8 年前

Another pro tip - absolute error magnitude is the convex hull of non-zero entry count for vectors (l_0 norm in some circles). So in the convex minimization context (and for most other smooth loss terms in general) you end up with solutions with more zero entries and few possibly large non-zero entries.

adamzerner超过 8 年前

Also see <a href="http://www.leeds.ac.uk/educol/documents/00003759.htm" rel="nofollow">http://www.leeds.ac.uk/educol/documents/00003759.htm</a>.

redcalx超过 8 年前

Somewhat related; here's my attempt at explaining Cross Entropy:<a href="http://heliosphan.org/cross-entropy.html" rel="nofollow">http://heliosphan.org/cross-entropy.html</a>

heisenbit超过 8 年前

Square often corresponds to power in systems.

评论 #13035682 未加载

fiatjaf超过 8 年前

Why geometric mean?, I would ask.

评论 #13033231 未加载

评论 #13032918 未加载

jayajay超过 8 年前

cause linear algebra is a beautiful framework to think in.

dschiptsov超过 8 年前

To make it positive and to amplify it (as a side-effect).BTW, "error" is a misleading term - it communicates some fault, at least in the common sense. Distance would be much better term.So, "squared distance" makes much more sense, because negative distance is nonsense.

评论 #13032406 未加载

评论 #13032507 未加载

bitL超过 8 年前

An honest question - do we even need statistics when we have machine learning? Statistics to me appears as a hack/aggregation of data we couldn't process at once in the past; these days ML + Big Data can achieve that and instead of statistics we can do computational inference instead. To me this looks like looking back to "old ways" for a reference point instead of looking forward to the unknown but more exciting.

评论 #13033240 未加载

评论 #13032759 未加载

评论 #13032667 未加载

评论 #13032710 未加载

评论 #13032733 未加载