Mathematical ignoramus writing here, but I have a long-term project to correct my ignorance of statistics so this seems a good place to start.<p>He isn't talking about <i>how</i> to calculate the linear regression, correct? He's talking about <i>why</i> using squared distances between data points and our line is a preferred technique over using absolute distances. Also, he doesn't explain why absolute distances produce multiple results I think? These aren't criticisms, I am just trying to make sure I understand.<p>ISTM that you have no idea how good your regression formula (y = ax + c) is without further info. You may have random data all over the place, and yet you will still come out with one linear regression to rule them all. His house price example is a good example of this: square footage is, obviously, only one of many factors that influence price -- and also the most easily quantified factor by far. Wouldn't a standard deviation be essential info to include?<p>Also, couldn't the fact that squared distance gives us only one result actually be a negative, since it can so easily oversimplify and therefore cut out a whole chunk of meaningful information?
One interesting property of least squares regression is that the predictions are the conditional expectation (mean) of the target variable given the right-hand-side variables. So in the OP example, we're predicting the average price of houses of a given size.<p>The notion of predicting the mean can be extended to other properties of the conditional distribution of the target variable, such as the median or other quantiles [0]. This comes with interesting implications, such as the well-known properties of the median being more robust to outliers than the mean. In fact, the absolute loss function mentioned in the article can be shown to give a conditional median prediction (using the mid-point in case of non-uniqueness). So in the OP example, if the data set is known to contain outliers like properties that have extremely high or low value due to idiosyncratic reasons (e.g. former celebrity homes or contaminated land) then the absolute loss could be a wiser choice than least squares (of course, there are other ways to deal with this as well).<p>Worth mentioning here I think because the OP seems to be holding a particular grudge against the absolute loss function. It's not perfect, but it has its virtues and some advantages over least squares. It's a trade-off, like so many things.<p>[0] <a href="https://en.wikipedia.org/wiki/Quantile_regression" rel="nofollow">https://en.wikipedia.org/wiki/Quantile_regression</a>
Some important context missing from this post (IMO) is that the data set presented is probably not a very good fit for linear regression, or really most classical models: You can see that there's way more variance at one end of the dataset. So even if we find the best model for the data that looks great in our gradient-descent-like visualization, it might not have that much predictive power. One common trick to deal with data sets like this is to map the data to another space where the distribution is more even and then build a model in <i>that</i> space. Then you can make predictions for the original data set by taking the inverse mapping on the outputs of the model.
This is very light and approachable but stops short of building the statistical intuition you want here. They fixate on the smoothness of squared errors without connecting that to the gaussian noise model and establishing how that relates to the predictive power against natural sorts of data.
I really recommend this explorable explanation: <a href="https://setosa.io/ev/ordinary-least-squares-regression/" rel="nofollow">https://setosa.io/ev/ordinary-least-squares-regression/</a><p>And for actual gradient descent code, here is an older example of mine in PyTorch: <a href="https://github.com/stared/thinking-in-tensors-writing-in-pytorch/blob/master/3%20Linear%20regression.ipynb">https://github.com/stared/thinking-in-tensors-writing-in-pyt...</a>
The main practical reason why square error is minimized in ordinary linear regression is that it has an analytical solution. Makes it a bit weird example for gradient descent.<p>There are plenty of error formulations that give a smooth loss function, and many even a convex one, but most don't have analytical solutions so they are solved via numerical optimization like GD.<p>The main message is IMHO correct though: square error (and its implicit gaussian noise assumption) is all too often used just per convenience and tradition.
> When using least squares, a zero derivative always marks a minimum. But that's not true in general ... To tell the difference between a minimum and a maximum, you'd need to look at the second derivative.<p>It's interesting to continue the analysis into higher dimensions, which have interesting stationary points that require looking at the matrix properties of a specific type of second order derivative (the Hessian) <a href="https://en.wikipedia.org/wiki/Saddle_point" rel="nofollow">https://en.wikipedia.org/wiki/Saddle_point</a><p>In general it's super powerful to convert data problems like linear regression into geometric considerations.
I intuitively think about linear regression as attaching a spring between every point and your regression line (and constraining the spring to be vertical). When the line settles, that's your regression! Also gives a physical intuition about what happens to the line when you add a point. Adding a point at the very end will "tilt" the line, while adding a point towards the middle of your distribution will shift it up or down.<p>A while ago I think I even proved to myself that this hypothetical mechanical system is mathematically equivalent to doing a linear regression, since the system naturally tries to minimize the potential energy.
I built a small static web app [0] (with svelte and tensorflow js) that shows gradient descent. It has two kind of problems: wave (the default) and linear. In the first case, the algorithm learns y = ax + b ; in the second, y = cos(ax + b).
The training data is generated from these functions with some noise.<p>I spent some time making it work with interpolation so that the transitions are smooth.<p>Then I expanded to another version, including a small neural network (nn) [1].<p>And finally, for the two functions that have a 2d parameter space, I included a viz of the loss [2]. You can click on the 2d space and get a new initial point for the descent, and see the trajectory.<p>Never really finished it, though I wrote a blog post about it [3]<p>[0] <a href="https://gradfront.pages.dev/" rel="nofollow">https://gradfront.pages.dev/</a><p>[1] <a href="https://f36dfeb7.gradfront.pages.dev/" rel="nofollow">https://f36dfeb7.gradfront.pages.dev/</a><p>[2] <a href="https://deploy-preview-1--gradient-descent.netlify.app/" rel="nofollow">https://deploy-preview-1--gradient-descent.netlify.app/</a><p>[3] <a href="https://blog.horaceg.xyz/posts/need-for-speed/" rel="nofollow">https://blog.horaceg.xyz/posts/need-for-speed/</a>
I don't have anything useful to say, but, how the hell is that a "12 min read"?<p>I always find those counters to greatly overestimate reading speed, but for a technical article like this it's outright insulting, to be honest.
Nice, thanks for sharing! I shared this with my HS calculus teacher :) (My model is that his students should be motivated to get machine learning engineering jobs, so they should be motivated to learn calculus, but who knows.)
In the same vein, Karpathy's video series "Neural Networks from zero to hero"[0] touches upon a lot of this and intuitions as well. One of the best introductory series (even if you ignore the neural net part of it) and brushes on gradients, differentiation and what it means intuitively.<p>[0] <a href="https://youtu.be/VMj-3S1tku0?si=jq1cCSn5si17KK1o" rel="nofollow">https://youtu.be/VMj-3S1tku0?si=jq1cCSn5si17KK1o</a>
Another way to approach the explanation is understanding the data generating process i.e. the statistical assumptions of the process that generates the data. That can go a long way to understanding _analytically_ if linear regression model is a good fit(or what to change in it to make it work). And — arguably more importantly — also a reason why we frame linear regression as a statistical problem instead of an optimization one(or an analytical OLS) in the first place. I would argue understanding it from a statistical standpoint provides much better intuition to a practitioner.<p>The reason to look at statistical assumptions, is because we want to make probabilistic/statistical statements about the response variable, like how much is its central tendency and how much it varies as values of X change. The response variable is not easy to measure.<p>Now, one can easily determine, for example using OLS(or gradient descent), the point estimates for parameters of a line that needs to be fit to two variables X and Y, without using any probability or statistical theory. OLS is, in point of fact, just an analytical result and has nothing to do with theory of statistics or inference. The assumptions of simple linear regression are statistical assumptions which can be right or wrong but if they hold, help us in making inferences, like:<p><pre><code> - Is the response variable varying uniformly over values of another r.v., X(predictors)?
- Assuming an r.v. Y what model can we make if its expectation is a linear function.
</code></pre>
So why do we make statistical assumptions instead of just point estimates?
Because all points of measurements can’t be certain and making those assumptions it is one way of quantifying uncertainty.. Indeed, going through history one finds that Regression's use outside experimental data(Galton 1885) was discovered much after least squares(Newton 1795-1809). The fundamental reasons to <i>understand</i> natural variations in data was the original motivation. In Galton's case he wanted to study hereditary traits like wealth over generations as well as others like height, status, intelligence( coincidentally its also what makes the assumptions of linear regression a good tool for studying this: I think it's the idea of Regression to the mean; Very Wealthy or very pool families don't remain so over a families generations, they regress towards the mean. So is the case with Societal Class, Intelligence over generations)<p>When you follow this arc of reasoning, you come to the following _statistical_ conditions the data must satisfy for linear assumptions to work(ish):<p>Linear mean function of the response variable conditioned on a value of X<p>E[Y|X=x] = \beta_0+\beta_1*x<p>Constant Variance of the response variable conditioned on a value of X<p>Var[Y|X=x] = \sigma^2 (OR ACTUALLY JUST FINITE ALSO WORKS WELL)
See another interactive article explaining linear regression and gradient descent: <a href="https://mlu-explain.github.io/linear-regression/" rel="nofollow">https://mlu-explain.github.io/linear-regression/</a>
All thats wrong with the modern world<p><a href="https://www.ibm.com/think/topics/linear-regression" rel="nofollow">https://www.ibm.com/think/topics/linear-regression</a><p>A proven way to scientifically and reliably predict the future<p>Business and organizational leaders can make better decisions by using linear regression techniques. Organizations collect masses of data, and linear regression helps them use that data to better manage reality, instead of relying on experience and intuition. You can take large amounts of raw data and transform it into actionable information.<p>You can also use linear regression to provide better insights by uncovering patterns and relationships that your business colleagues might have previously seen and thought they already understood.<p>For example, performing an analysis of sales and purchase data can help you uncover specific purchasing patterns on particular days or at certain times. Insights gathered from regression analysis can help business leaders anticipate times when their company’s products will be in high demand.
The amount of em dashes in this make this look very AI written. Which doesn't make it a bad piece but makes me more carefully check every sentence for errors.