Another way to approach the explanation is understanding the data generating process i.e. the statistical assumptions of the process that generates the data. That can go a long way to understanding _analytically_ if linear regression model is a good fit(or what to change in it to make it work). And — arguably more importantly — also a reason why we frame linear regression as a statistical problem instead of an optimization one(or an analytical OLS) in the first place. I would argue understanding it from a statistical standpoint provides much better intuition to a practitioner.<p>The reason to look at statistical assumptions, is because we want to make probabilistic/statistical statements about the response variable, like how much is its central tendency and how much it varies as values of X change. The response variable is not easy to measure.<p>Now, one can easily determine, for example using OLS(or gradient descent), the point estimates for parameters of a line that needs to be fit to two variables X and Y, without using any probability or statistical theory. OLS is, in point of fact, just an analytical result and has nothing to do with theory of statistics or inference. The assumptions of simple linear regression are statistical assumptions which can be right or wrong but if they hold, help us in making inferences, like:<p><pre><code> - Is the response variable varying uniformly over values of another r.v., X(predictors)?
- Assuming an r.v. Y what model can we make if its expectation is a linear function.
</code></pre>
So why do we make statistical assumptions instead of just point estimates?
Because all points of measurements can’t be certain and making those assumptions it is one way of quantifying uncertainty.. Indeed, going through history one finds that Regression's use outside experimental data(Galton 1885) was discovered much after least squares(Newton 1795-1809). The fundamental reasons to <i>understand</i> natural variations in data was the original motivation. In Galton's case he wanted to study hereditary traits like wealth over generations as well as others like height, status, intelligence( coincidentally its also what makes the assumptions of linear regression a good tool for studying this: I think it's the idea of Regression to the mean; Very Wealthy or very pool families don't remain so over a families generations, they regress towards the mean. So is the case with Societal Class, Intelligence over generations)<p>When you follow this arc of reasoning, you come to the following _statistical_ conditions the data must satisfy for linear assumptions to work(ish):<p>Linear mean function of the response variable conditioned on a value of X<p>E[Y|X=x] = \beta_0+\beta_1*x<p>Constant Variance of the response variable conditioned on a value of X<p>Var[Y|X=x] = \sigma^2 (OR ACTUALLY JUST FINITE ALSO WORKS WELL)