TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

ML From Scratch, Part 1: Linear Regression

301 点作者 olooney超过 5 年前

11 条评论

stared超过 5 年前
I tried to do from scratch AND hands-on in PyTorch:<p><a href="https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;stared&#x2F;thinking-in-tensors-writing-in-pytorch&#x2F;blob&#x2F;master&#x2F;3%20Linear%20regression.ipynb" rel="nofollow">https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;stared&#x2F;thinking-in-...</a><p>By all means, much more &quot;from scratch&quot;.<p>It is a part of open-source &quot;Thinking in Tensors, Writing in PyTorch&quot;: <a href="https:&#x2F;&#x2F;github.com&#x2F;stared&#x2F;thinking-in-tensors-writing-in-pytorch" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;stared&#x2F;thinking-in-tensors-writing-in-pyt...</a>
评论 #21298980 未加载
评论 #21309901 未加载
Sedai05超过 5 年前
It says it&#x27;s from scratch, but it&#x27;s using a lot of words and math symbols I don&#x27;t understand when describing Linear Regression.
评论 #21301246 未加载
评论 #21302132 未加载
评论 #21300289 未加载
评论 #21299805 未加载
z3phyr超过 5 年前
At first I thought it was about implementing an ML like language. Then the realisation hit pretty fast..
评论 #21302971 未加载
评论 #21300801 未加载
Sinidir超过 5 年前
Understood everything up until this:<p><pre><code> ∇ΘJ=∇Θ(y−XΘ)T(y−XΘ) </code></pre> expanded into<p><pre><code> ∇ΘJ=∇ΘyTy−(XΘ)Ty−yTXΘ+ΘT(XTX)Θ </code></pre> Can someone explain why ∇Θ is only applied to the first term? Also i have never seen gradient notation used like that before.
评论 #21300012 未加载
评论 #21302145 未加载
评论 #21299986 未加载
评论 #21300982 未加载
MidgetGourde超过 5 年前
I could balance equations in chemistry and do some algebra and trigonometry at GCSE level. But this blows my mind. Higher tier GCSE Maths. I should have paid more attention. I think I know of what regression is in a simpistic term. It&#x27;s like taking a rolling graph and predicting the next point on the x &#x2F; y axis, using the previous data as a training model. I hope.
评论 #21301238 未加载
j7ake超过 5 年前
I found the notation here non-standard and confusing.<p>In my opinion, the wikipedia article is more concise and provides more intuition (especially the multiple ways to derive the closed form solution).<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Ordinary_least_squares#Alternative_derivations" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Ordinary_least_squares#Alterna...</a>
评论 #21302151 未加载
adipginting超过 5 年前
For a computer science phd or master student candidate, I think the article or the site overall is very good
评论 #21302155 未加载
dlphn___xyz超过 5 年前
one of the best ‘from scratch’ articles ive read on LR
graycat超过 5 年前
=== Introduction<p>Here I give some notes for a shorter, simpler, more general, no calculus derivatives derivation of the basics of regression analysis.<p>=== The Given Data<p>Let R be the set of real numbers (can use the set of rational numbers instead if wish).<p>We are given positive integers m and n.<p>In our typing, character &#x27;^&#x27; starts a superscript; character &#x27;_&#x27; starts a subscript.<p>We regard R^m as the set of all m x 1 matrices (column vectors); that is, each point in R^m is a matrix with m rows and 1 column with its components in set R.<p>We are given m x 1 y in R^m.<p>We are given m x n matrix A with components in R.<p>We will need to know what matrix <i>transpose</i> is: For an intuitive view, the <i>transpose</i> of A is A^T where each column of A becomes a row of A^T. Or to be explicit, for i = 1 to m and j = 1 to n, the component in row i and column j of A becomes the component in row j and column i of A^T.<p>We notice that each column of A is m x 1 and, thus, a point in R^m.<p>Suppose b is n x 1 with components in R.<p>Then the matrix product Ab is m x 1 in R^m and is a <i>linear combination</i> (with <i>coefficients</i> from b) of the columns of A.<p>=== Real World<p>So we have data A and y. To get this data, maybe we got some medical record data on m = 100,000 people.<p>For each person we got data on age, income, years of schooling, height, weight, birth gender (0 for male, 1 for female), years of smoking, blood sugar level, blood cholesterol level, race (1 to 5 from white, black, Hispanic, Asian, other), that is n = 10 variables. We get to select these variables and call them <i>independent</i>.<p>For person i = 1 to m = 100,000, we put the value of variable j = 1 to n = 10 in row i and column j of matrix A.<p>Or matrix A has one row for each person and one column for each variable.<p>Vector y has systolic blood pressure: For i = 1 to m = 100,000, row i of y, that is, y_i, has the systolic blood pressure of person i. We regard systolic blood pressure in y as the <i>dependent</i> variable.<p>What we want to do is use our data to get a mathematical function that for any person, in the m = 100,000 or not, takes in the values of the independent variables, 1 x n v, and returns the corresponding value of the dependent variable 1 x 1 z. To this end we want n x 1 b so that we have<p>z = vb<p>So for any person, we take their independent variable values v, apply coefficients b, and get their dependent variable value z.<p>We seek the b we will use on all people.<p>=== The <i>Normal Equations</i><p>So, we have our given m x n A and m x 1 y, and we seek our n x 1 b.<p>We let<p>S = { Au | n x 1 u }<p>That is, S is the set of all the <i>linear combinations</i> of the columns of A, a <i>hyperplane</i> in R^m, a generalization of a plane in three dimensions or a line in two dimensions, a vector <i>subspace</i> of R^m, the vector subspace of R^m <i>spanned</i> by the columns of A.<p>We seek w in S to minimize the squared <i>length</i><p>||y - w||^2<p>= \sum_{i = 1}^m (y_i - w_i)^2<p>(notation from D. Knuth&#x27;s math typesetting software TeX), that is the sum (capital letter Greek sigma) from i = 1 to m of<p>(y_i - w_i)^2<p>where y_i is component i of m x 1 y in R^m and similarly for w_i.<p>That is, we seek the w in the hyperplane S that is closest to our dependent variable value y.<p>Well, from some simple geometry, the vector<p>y - w<p>has to be perpendicular to the hyperplane S.<p>Also from the geometry, w is unique, the only point in S that minimizes the squared length<p>||y - w||^2<p>For the <i>geometry</i>, there is one classic theorem -- in W. Rudin, <i>Real and Complex Analysis</i> -- that will mostly do: In quite general situations, more general than R^m, every non-empty, closed, convex set has a unique element of minimum norm (length).<p>Now that we have the definitions and tools we need, we derive the <i>normal equations</i> in just two lines of matrix algebra:<p>Since (y - w) is perpendicular to each column of A, we have that<p>A^T (y - w) = 0<p>where the right side is an n x 1 matrix of 0s.<p>Then<p>A^T y = A^T Ab<p>or with the usual writing order<p>(A^T A)b = A^T y<p>the <i>normal equations</i> where we have A and y and solve for b.<p>=== Results<p>Since w is in S, the equations do have a solution. Moreover, from the geometry, w is unique.<p>If the n x n square matrix (A^T A) has an inverse, then the solution, the b is also unique.<p>Note: Vector w DOES exist and is unique; b DOES exist; if the inverse of (A^T A) exists, then b is unique; otherwise b STILL exists but is not unique. STILL w is unique. How &#x27;bout that!<p>Since w is in S and since (y - w) is perpendicular to all the vectors in S, we have that<p>(y - w)^T w = 0<p>y^T y = (y - w + w)^T (y - w + w)<p>= (y - w)^T (y - w) + (y - w)^T w + w^T (y - w) + w^T w<p>= (y - w)^T (y - w) + w^T w<p>or<p>[total sum of squares] = [regression sum of squares] + [error sum of squares]<p>or the Pythagorean theorem.<p>So, now that we have found b, for any v, we can get our desired z from vb.<p>The b we found works <i>best</i> in the least squares sense for the m = 100,000 data we had. If the 100,000 are an appropriate <i>sample</i> of a billion, and if our b works well on our sample, then maybe b will work well on the billion. Ah, maybe could prove some theorems here, theorems with meager or no more assumptions than we have made so far!<p>&quot;Look, Ma, no calculus! And no mention of the Gaussian distribution. No mention of maximum likelihood.<p>Except for regarding the m = 100,000 <i>observations</i> as a <i>sample</i> from a billion, we have mentioned no probability concepts. We never asked for a matrix inverse. And we never mentioned <i>multicollinearity</i>.&quot;<p>And, maybe the inverse of n x n (A^T A) does exist but is &quot;numerically unstable&quot;. Sooooo, we begin to suspect: The b we get may be inaccurate but the w may still be fine and our<p>z = vb<p>may also still be accurate. Might want to look into this.<p>For a start, sure, the matrix (A^T A) being numerically unstable is essentially the same as the ratio of the largest and smallest (positive, they are all positive) eigenvalues being large. That is, roughly, the problem is that the small eigenvalues are, in the <i>scale</i> of the problem, close to 0. Even if they are 0, we have shown that our w is still unique.
vermooten超过 5 年前
Hardly &#x27;from scratch&#x27;, I was expecting an ELI5 article.
评论 #21298978 未加载
评论 #21299640 未加载
smitty1e超过 5 年前
The LaTeX was LaToast, which is LaShame.
评论 #21298829 未加载