Reasons to not use PCA for feature selection

99 点作者 leonry大约 3 年前

11 条评论

jstx1大约 3 年前

The only reason you need: PCA is not a feature selection algorithmIs the author misunderstanding something very basic or are they deliberately writing this way for clicks and attention? I can see that they have great credentials so probably the latter? It's a weird article.

评论 #30877769 未加载

评论 #30876734 未加载

评论 #30876675 未加载

评论 #30876665 未加载

评论 #30877479 未加载

评论 #30876808 未加载

civilized大约 3 年前

Now that we've all said "feature selection is not dimensionality reduction" to our hearts' content, could we return to the point of the article?Regardless of whether you're doing feature selection or dimensionality reduction, the point remains that, if you're doing supervised learning, PCA is just compressing your X space, without any regard to your y. It could be that the last principal component of X, containing only 0.1% of the variance, contains 100% of the correlation between X and y.Using PCA for dimensionality reduction in a supervised learning context means throwing out an unknown amount of signal, which could be up to 100% of the signal.Now for unsupervised, exploratory analysis, PCA is definitely a candidate, but there are plenty of often-better alternatives there too.

评论 #30877719 未加载

评论 #30879283 未加载

评论 #30920987 未加载

quanto大约 3 年前

PCA can be construed as a loss compression of the data matrix. In fact, Eckart-Young theorem shows that this is an optimal compression (optimal w.r.t low rank, i.e. space to hold the values). In the language of OP, this shows the minimal energy loss for a given space constraint.The key word is "lossy". It may as well be that the loss'ed part had the signal for further classification down the pipeline. Or may be not. It depends on the case.

srean大约 3 年前

There was a recent discussion on PCA for classification that I had walked into, however, everyone had left the building when I joined. Since I run into this conceptual misunderstanding of PCA's relevance to classification often, let me repeat what I said there.The problem with using PCA as a preprocessing step for linear classification is that this dimensionality reduction step is being done without paying any heed to the end goal -- better linear separation of the classes. One can get lucky and get a low-d projection that separates well but that is pure luck. Let me see if I can draw an example<pre><code> ++++++ ++++++++++++++++ +++ --------- ----------------- ------- </code></pre> The '+' and '-' denote the data points of the two different classes. In this example the PCA direction will be the along the X axis, which would be the worst axis to project on to separate the classes. The best in this case would have been the Y axis.A far better approach would be to use a dimensionality reduction technique that is aware of the end goal. One such example is Fisher discriminant analysis and its kernelized variant.

评论 #30877626 未加载

platz大约 3 年前

> 1) Does linearly combining my features make any sense?> 2) Can I think of an explanation for why the linearly combined features could have as simple a relationship to the target as the original features?the article provided a negative example where pca does not fit, but doesnt provide an example where it does or what pca is actually used for. I come away from this article thinking pca is useless.what would be an example where 2) is true?I cannot answer 2) without already having experience of what explanations there could possibly be. ( 2) is almost begging the question, at least pedagogically- pca is good when the features are good for pca)When does linearly combining my features "make sense"? again, an example is not provided

评论 #30876924 未加载

评论 #30877174 未加载

raverbashing大约 3 年前

Here's a better idea, don't use generalized statements but rather test your data first.If your models fits well (and doesn't overfit) after PCA, then go for it. If not, revisit.PCA has its place, and as the other commenter said, sure, it's not a feature selection algorithm. Or you can just feature select manually.

评论 #30877641 未加载

anonymoushn大约 3 年前

Much of the text is outside the viewport and cannot be scrolled to on mobile.

first_post大约 3 年前

What would you recommend for feature selection in say, single-cell RNA seq studies (Typical dataset is ~10,000 x ~30000 (cells x genes) with >90% of your table filled with 0s (which could be due to biological or technical noise)PCA and UMAP are yes, dimensionality reduction methods, but are often seen as tools for feature selection.See slide 61 Here: <a href="https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/10_lecture.pdf" rel="nofollow">https://physiology.med.cornell.edu/faculty/skrabanek/lab/ang...</a>

plandis大约 3 年前

I followed the math but I don’t know ML… Do practitioners really use “energy” and “conservation of energy”? That just seems overly confusing.

评论 #30921191 未加载

评论 #30888332 未加载

rybosworld大约 3 年前

PCA has a niche use-case. It's more often harmful than not in my experience.

评论 #30877630 未加载

ylks大约 3 年前

Hello,I'm the author of the post. I'm slightly late to the party, but I'll try to clarify a few misunderstandings.First and foremost, the post deals with the following scenario too many data scientists find themselves in: "I have (generated) a lot of features; let me do PCA and train my model using the top few principal components". This is a terrible idea and the post explains why.Second, there seems to be a debate about 'feature selection' vs. 'feature construction' (or 'feature generation'), and whether PCA is of the former or latter type. Here are the definitions I use in the whole blog.Feature Construction is the process consisting of generating candidate features (i.e. transformations of the original inputs) that might have a simpler relationship with the target, one that models in our toolbox can reliably learn.E.g. a linear model cannot learn a quadratic function. However, because a quadratic function of x is linear in [x, x^2], the feature transformation x -> [x, x^2] is needed to make our quadratic function learnable by a linear model.Check out this post for more details on what's at stake during feature construction: <a href="https://blog.kxy.ai/feature-construction/" rel="nofollow">https://blog.kxy.ai/feature-construction/</a>.Feature Selection is the process consisting of removing useless features in the set of candidates generated by feature construction. A feature is deemed useless when it is uninformative about the target or redundant.Check out this post for a formulation of three key properties of a feature using Shapley values: feature importance, feature usefulness, and feature potential. <a href="https://blog.kxy.ai/feature-engineering-with-game-theory-beyond-shap/" rel="nofollow">https://blog.kxy.ai/feature-engineering-with-game-theory-bey...</a>In the scenario the blog post deals with (i.e. "I have (generated) a lot of features; let me do PCA and train my model using the top few principal components"), data scientists do both feature construction (full PCA, i.e. projecting the original input onto eigenvectors to obtain as many principal components as the dimension of the original input) AND feature selection (only selecting the first few principal components with the highest eigenvalues).When the goal is to predict y from x, using PCA for either feature construction OR feature selection is a bad idea!For feature construction, there is nothing in PCA that will intrinsically guarantee that a linear combination of coordinates of x will have a simpler relationship to y than x itself. PCA does not even use y in this case! E.g. Imagine all coordinates of x but the first are pure noise (as far as predicting y is concerned). Any linear combination of x will just make your inputs noisy!For feature selection, even assuming principal components make sense as features, principal components with the highest variances (i.e. corresponding to the highest eigenvalues) need not be the most useful for predicting y! High variance does not imply high signal.