科技回声

11 条评论

stared超过 5 年前

I tried to do from scratch AND hands-on in PyTorch:<a href="https://colab.research.google.com/github/stared/thinking-in-tensors-writing-in-pytorch/blob/master/3%20Linear%20regression.ipynb" rel="nofollow">https://colab.research.google.com/github/stared/thinking-in-...</a>By all means, much more "from scratch".It is a part of open-source "Thinking in Tensors, Writing in PyTorch": <a href="https://github.com/stared/thinking-in-tensors-writing-in-pytorch" rel="nofollow">https://github.com/stared/thinking-in-tensors-writing-in-pyt...</a>

评论 #21298980 未加载

评论 #21309901 未加载

Sedai05超过 5 年前

It says it's from scratch, but it's using a lot of words and math symbols I don't understand when describing Linear Regression.

评论 #21301246 未加载

评论 #21302132 未加载

评论 #21300289 未加载

评论 #21299805 未加载

z3phyr超过 5 年前

At first I thought it was about implementing an ML like language. Then the realisation hit pretty fast..

评论 #21302971 未加载

评论 #21300801 未加载

Sinidir超过 5 年前

Understood everything up until this:<pre><code> ∇ΘJ=∇Θ(y−XΘ)T(y−XΘ) </code></pre> expanded into<pre><code> ∇ΘJ=∇ΘyTy−(XΘ)Ty−yTXΘ+ΘT(XTX)Θ </code></pre> Can someone explain why ∇Θ is only applied to the first term? Also i have never seen gradient notation used like that before.

评论 #21300012 未加载

评论 #21302145 未加载

评论 #21299986 未加载

评论 #21300982 未加载

MidgetGourde超过 5 年前

I could balance equations in chemistry and do some algebra and trigonometry at GCSE level. But this blows my mind. Higher tier GCSE Maths. I should have paid more attention. I think I know of what regression is in a simpistic term. It's like taking a rolling graph and predicting the next point on the x / y axis, using the previous data as a training model. I hope.

评论 #21301238 未加载

j7ake超过 5 年前

I found the notation here non-standard and confusing.In my opinion, the wikipedia article is more concise and provides more intuition (especially the multiple ways to derive the closed form solution).<a href="https://en.wikipedia.org/wiki/Ordinary_least_squares#Alternative_derivations" rel="nofollow">https://en.wikipedia.org/wiki/Ordinary_least_squares#Alterna...</a>

评论 #21302151 未加载

adipginting超过 5 年前

For a computer science phd or master student candidate, I think the article or the site overall is very good

评论 #21302155 未加载

dlphn___xyz超过 5 年前

one of the best ‘from scratch’ articles ive read on LR

graycat超过 5 年前

=== IntroductionHere I give some notes for a shorter, simpler, more general, no calculus derivatives derivation of the basics of regression analysis.=== The Given DataLet R be the set of real numbers (can use the set of rational numbers instead if wish).We are given positive integers m and n.In our typing, character '^' starts a superscript; character '_' starts a subscript.We regard R^m as the set of all m x 1 matrices (column vectors); that is, each point in R^m is a matrix with m rows and 1 column with its components in set R.We are given m x 1 y in R^m.We are given m x n matrix A with components in R.We will need to know what matrix transpose is: For an intuitive view, the transpose of A is A^T where each column of A becomes a row of A^T. Or to be explicit, for i = 1 to m and j = 1 to n, the component in row i and column j of A becomes the component in row j and column i of A^T.We notice that each column of A is m x 1 and, thus, a point in R^m.Suppose b is n x 1 with components in R.Then the matrix product Ab is m x 1 in R^m and is a linear combination (with coefficients from b) of the columns of A.=== Real WorldSo we have data A and y. To get this data, maybe we got some medical record data on m = 100,000 people.For each person we got data on age, income, years of schooling, height, weight, birth gender (0 for male, 1 for female), years of smoking, blood sugar level, blood cholesterol level, race (1 to 5 from white, black, Hispanic, Asian, other), that is n = 10 variables. We get to select these variables and call them independent.For person i = 1 to m = 100,000, we put the value of variable j = 1 to n = 10 in row i and column j of matrix A.Or matrix A has one row for each person and one column for each variable.Vector y has systolic blood pressure: For i = 1 to m = 100,000, row i of y, that is, y_i, has the systolic blood pressure of person i. We regard systolic blood pressure in y as the dependent variable.What we want to do is use our data to get a mathematical function that for any person, in the m = 100,000 or not, takes in the values of the independent variables, 1 x n v, and returns the corresponding value of the dependent variable 1 x 1 z. To this end we want n x 1 b so that we havez = vbSo for any person, we take their independent variable values v, apply coefficients b, and get their dependent variable value z.We seek the b we will use on all people.=== The Normal EquationsSo, we have our given m x n A and m x 1 y, and we seek our n x 1 b.We letS = { Au | n x 1 u }That is, S is the set of all the linear combinations of the columns of A, a hyperplane in R^m, a generalization of a plane in three dimensions or a line in two dimensions, a vector subspace of R^m, the vector subspace of R^m spanned by the columns of A.We seek w in S to minimize the squared length||y - w||^2= \sum_{i = 1}^m (y_i - w_i)^2(notation from D. Knuth's math typesetting software TeX), that is the sum (capital letter Greek sigma) from i = 1 to m of(y_i - w_i)^2where y_i is component i of m x 1 y in R^m and similarly for w_i.That is, we seek the w in the hyperplane S that is closest to our dependent variable value y.Well, from some simple geometry, the vectory - whas to be perpendicular to the hyperplane S.Also from the geometry, w is unique, the only point in S that minimizes the squared length||y - w||^2For the geometry, there is one classic theorem -- in W. Rudin, Real and Complex Analysis -- that will mostly do: In quite general situations, more general than R^m, every non-empty, closed, convex set has a unique element of minimum norm (length).Now that we have the definitions and tools we need, we derive the normal equations in just two lines of matrix algebra:Since (y - w) is perpendicular to each column of A, we have thatA^T (y - w) = 0where the right side is an n x 1 matrix of 0s.ThenA^T y = A^T Abor with the usual writing order(A^T A)b = A^T ythe normal equations where we have A and y and solve for b.=== ResultsSince w is in S, the equations do have a solution. Moreover, from the geometry, w is unique.If the n x n square matrix (A^T A) has an inverse, then the solution, the b is also unique.Note: Vector w DOES exist and is unique; b DOES exist; if the inverse of (A^T A) exists, then b is unique; otherwise b STILL exists but is not unique. STILL w is unique. How 'bout that!Since w is in S and since (y - w) is perpendicular to all the vectors in S, we have that(y - w)^T w = 0y^T y = (y - w + w)^T (y - w + w)= (y - w)^T (y - w) + (y - w)^T w + w^T (y - w) + w^T w= (y - w)^T (y - w) + w^T wor[total sum of squares] = [regression sum of squares] + [error sum of squares]or the Pythagorean theorem.So, now that we have found b, for any v, we can get our desired z from vb.The b we found works best in the least squares sense for the m = 100,000 data we had. If the 100,000 are an appropriate sample of a billion, and if our b works well on our sample, then maybe b will work well on the billion. Ah, maybe could prove some theorems here, theorems with meager or no more assumptions than we have made so far!"Look, Ma, no calculus! And no mention of the Gaussian distribution. No mention of maximum likelihood.Except for regarding the m = 100,000 observations as a sample from a billion, we have mentioned no probability concepts. We never asked for a matrix inverse. And we never mentioned multicollinearity."And, maybe the inverse of n x n (A^T A) does exist but is "numerically unstable". Sooooo, we begin to suspect: The b we get may be inaccurate but the w may still be fine and ourz = vbmay also still be accurate. Might want to look into this.For a start, sure, the matrix (A^T A) being numerically unstable is essentially the same as the ratio of the largest and smallest (positive, they are all positive) eigenvalues being large. That is, roughly, the problem is that the small eigenvalues are, in the scale of the problem, close to 0. Even if they are 0, we have shown that our w is still unique.

vermooten超过 5 年前

Hardly 'from scratch', I was expecting an ELI5 article.

评论 #21298978 未加载

评论 #21299640 未加载

smitty1e超过 5 年前

The LaTeX was LaToast, which is LaShame.

评论 #21298829 未加载

11 条评论

stared超过 5 年前

评论 #21298980 未加载

评论 #21309901 未加载

Sedai05超过 5 年前

It says it's from scratch, but it's using a lot of words and math symbols I don't understand when describing Linear Regression.

评论 #21301246 未加载

评论 #21302132 未加载

评论 #21300289 未加载

评论 #21299805 未加载

z3phyr超过 5 年前

At first I thought it was about implementing an ML like language. Then the realisation hit pretty fast..

评论 #21302971 未加载

评论 #21300801 未加载

Sinidir超过 5 年前

评论 #21300012 未加载

评论 #21302145 未加载

评论 #21299986 未加载

评论 #21300982 未加载

MidgetGourde超过 5 年前

评论 #21301238 未加载

j7ake超过 5 年前

评论 #21302151 未加载

adipginting超过 5 年前

For a computer science phd or master student candidate, I think the article or the site overall is very good

评论 #21302155 未加载

dlphn___xyz超过 5 年前

one of the best ‘from scratch’ articles ive read on LR

graycat超过 5 年前

vermooten超过 5 年前

Hardly 'from scratch', I was expecting an ELI5 article.

评论 #21298978 未加载

评论 #21299640 未加载

smitty1e超过 5 年前

The LaTeX was LaToast, which is LaShame.

评论 #21298829 未加载

ML From Scratch, Part 1: Linear Regression

11 条评论

ML From Scratch, Part 1: Linear Regression

11 条评论