I'm just wrapping up a full semester course on multiple regression and reading this is a great and different perspective on it.<p>I definitely appreciate the simple approach in the article. If the OP is like myself, perhaps he's posting this to better his understanding and leaving artifacts for others to follow as they learn. I have to point out, there's so much more happening in regression. To do it well, read further on it.<p>As a concrete example of why- The author mentions the R^2 value but doesn't seem to warn that adding more variables to your model will artificially increase it. For this reason, a better value is the "Adjusted R^2" which adjusts for that. Also testing the validity of your model, building it up from scratch, understanding you can't predict outside of the domain of your independent variables, etc.<p>With that out of the way, I very much enjoyed seeing some of the math behind this. My class was entirely focused on just learning to use a statistical package to run regression. That's perfectly adequate, fine, and all I'll use on a day to day basis. Understanding what's going on beneath the covers has always just enabled me to be more powerful at the given task.<p>Thanks!
Not to bag on the write up, as it was well done, but does linear regression really qualify as machine learning? Almost every stats 101 course covers the topic, and probably 99% of people who use linear regressions in their day-to-day work would not call it machine learning. I know that linear regression is sometimes presented in Machine Learning courses, but I always thought is was done as a refresher, and not as actual course material of any significant weight.
Totally bad article. It encourages bad practices like checking validity on the same set as model was trained on.
You should do some cross-validation or at least split data into two parts, train model on the first part and test it on the second part.
Nice write up! I'd caution just checking for an exactly-zero determinant. Read up about ill-conditioned matrices, and maybe check that conditioning number (or a determinant below a certain threshold) first. Also, work hard to never, ever have to actually fully invert a matrix.
Not bad, but I'd rather have seen statsmodels[1] (which is more intuitive to use, and gives you more data, as well as methods for displaying it) than sklearn used for the library. I understand the choice given that it's "machine learning", but as the comments are demonstrating, the distinction's not actually that clear.<p>[1] <a href="http://statsmodels.sourceforge.net/stable/gettingstarted.html" rel="nofollow">http://statsmodels.sourceforge.net/stable/gettingstarted.htm...</a>