Some important context missing from this post (IMO) is that the data set presented is probably not a very good fit for linear regression, or really most classical models: You can see that there's way more variance at one end of the dataset. So even if we find the best model for the data that looks great in our gradient-descent-like visualization, it might not have that much predictive power. One common trick to deal with data sets like this is to map the data to another space where the distribution is more even and then build a model in <i>that</i> space. Then you can make predictions for the original data set by taking the inverse mapping on the outputs of the model.
One interesting property of least squares regression is that the predictions are the conditional expectation (mean) of the target variable given the right-hand-side variables. So in the OP example, we're predicting the average price of houses of a given size.<p>The notion of predicting the mean can be extended to other properties of the conditional distribution of the target variable, such as the median or other quantiles [0]. This comes with interesting implications, such as the well-known properties of the median being more robust to outliers than the mean. In fact, the absolute loss function mentioned in the article can be shown to give a conditional median prediction (using the mid-point in case of non-uniqueness). So in the OP example, if the data set is known to contain outliers like properties that have extremely high or low value due to idiosyncratic reasons (e.g. former celebrity homes or contaminated land) then the absolute loss could be a wiser choice than least squares (of course, there are other ways to deal with this as well).<p>Worth mentioning here I think because the OP seems to be holding a particular grudge against the absolute loss function. It's not perfect, but it has its virtues and some advantages over least squares. It's a trade-off, like so many things.<p>[0] <a href="https://en.wikipedia.org/wiki/Quantile_regression" rel="nofollow">https://en.wikipedia.org/wiki/Quantile_regression</a>
This is very light and approachable but stops short of building the statistical intuition you want here. They fixate on the smoothness of squared errors without connecting that to the gaussian noise model and establishing how that relates to the predictive power against natural sorts of data.
I really recommend this explorable explanation: <a href="https://setosa.io/ev/ordinary-least-squares-regression/" rel="nofollow">https://setosa.io/ev/ordinary-least-squares-regression/</a><p>And for actual gradient descent code, here is an older example of mine in PyTorch: <a href="https://github.com/stared/thinking-in-tensors-writing-in-pytorch/blob/master/3%20Linear%20regression.ipynb">https://github.com/stared/thinking-in-tensors-writing-in-pyt...</a>
The main practical reason why square error is minimized in ordinary linear regression is that it has an analytical solution. Makes it a bit weird example for gradient descent.<p>There are plenty of error formulations that give a smooth loss function, and many even a convex one, but most don't have analytical solutions so they are solved via numerical optimization like GD.<p>The main message is IMHO correct though: square error (and its implicit gaussian noise assumption) is all too often used just per convenience and tradition.
I built a small static web app [0] (with svelte and tensorflow js) that shows gradient descent. It has two kind of problems: wave (the default) and linear. In the first case, the algorithm learns y = ax + b ; in the second, y = cos(ax + b).
The training data is generated from these functions with some noise.<p>I spent some time making it work with interpolation so that the transitions are smooth.<p>Then I expanded to another version, including a small neural network (nn) [1].<p>And finally, for the two functions that have a 2d parameter space, I included a viz of the loss [2]. You can click on the 2d space and get a new initial point for the descent, and see the trajectory.<p>Never really finished it, though I wrote a blog post about it [3]<p>[0] <a href="https://gradfront.pages.dev/" rel="nofollow">https://gradfront.pages.dev/</a><p>[1] <a href="https://f36dfeb7.gradfront.pages.dev/" rel="nofollow">https://f36dfeb7.gradfront.pages.dev/</a><p>[2] <a href="https://deploy-preview-1--gradient-descent.netlify.app/" rel="nofollow">https://deploy-preview-1--gradient-descent.netlify.app/</a><p>[3] <a href="https://blog.horaceg.xyz/posts/need-for-speed/" rel="nofollow">https://blog.horaceg.xyz/posts/need-for-speed/</a>
> When using least squares, a zero derivative always marks a minimum. But that's not true in general ... To tell the difference between a minimum and a maximum, you'd need to look at the second derivative.<p>It's interesting to continue the analysis into higher dimensions, which have interesting stationary points that require looking at the matrix properties of a specific type of second order derivative (the Hessian) <a href="https://en.wikipedia.org/wiki/Saddle_point" rel="nofollow">https://en.wikipedia.org/wiki/Saddle_point</a><p>In general it's super powerful to convert data problems like linear regression into geometric considerations.
I don't have anything useful to say, but, how the hell is that a "12 min read"?<p>I always find those counters to greatly overestimate reading speed, but for a technical article like this it's outright insulting, to be honest.
Nice, thanks for sharing! I shared this with my HS calculus teacher :) (My model is that his students should be motivated to get machine learning engineering jobs, so they should be motivated to learn calculus, but who knows.)
In the same vein, Karpathy's video series "Neural Networks from zero to hero"[0] touches upon a lot of this and intuitions as well. One of the best introductory series (even if you ignore the neural net part of it) and brushes on gradients, differentiation and what it means intuitively.<p>[0] <a href="https://youtu.be/VMj-3S1tku0?si=jq1cCSn5si17KK1o" rel="nofollow">https://youtu.be/VMj-3S1tku0?si=jq1cCSn5si17KK1o</a>
See another interactive article explaining linear regression and gradient descent: <a href="https://mlu-explain.github.io/linear-regression/" rel="nofollow">https://mlu-explain.github.io/linear-regression/</a>
All thats wrong with the modern world<p><a href="https://www.ibm.com/think/topics/linear-regression" rel="nofollow">https://www.ibm.com/think/topics/linear-regression</a><p>A proven way to scientifically and reliably predict the future<p>Business and organizational leaders can make better decisions by using linear regression techniques. Organizations collect masses of data, and linear regression helps them use that data to better manage reality, instead of relying on experience and intuition. You can take large amounts of raw data and transform it into actionable information.<p>You can also use linear regression to provide better insights by uncovering patterns and relationships that your business colleagues might have previously seen and thought they already understood.<p>For example, performing an analysis of sales and purchase data can help you uncover specific purchasing patterns on particular days or at certain times. Insights gathered from regression analysis can help business leaders anticipate times when their company’s products will be in high demand.
The amount of em dashes in this make this look very AI written. Which doesn't make it a bad piece but makes me more carefully check every sentence for errors.