Hello,<p>I'm the author of the post. I'm slightly late to the party, but I'll try to clarify a few misunderstandings.<p>First and foremost, the post deals with the following scenario too many data scientists find themselves in: "I have (generated) a lot of features; let me do PCA and train my model using the top few principal components". This is a terrible idea and the post explains why.<p>Second, there seems to be a debate about 'feature selection' vs. 'feature construction' (or 'feature generation'), and whether PCA is of the former or latter type. Here are the definitions I use in the whole blog.<p>Feature Construction is the process consisting of generating candidate features (i.e. transformations of the original inputs) that might have a simpler relationship with the target, one that models in our toolbox can reliably learn.<p>E.g. a linear model cannot learn a quadratic function. However, because a quadratic function of x is linear in [x, x^2], the feature transformation x -> [x, x^2] is needed to make our quadratic function learnable by a linear model.<p>Check out this post for more details on what's at stake during feature construction: <a href="https://blog.kxy.ai/feature-construction/" rel="nofollow">https://blog.kxy.ai/feature-construction/</a>.<p>Feature Selection is the process consisting of removing useless features in the set of candidates generated by feature construction. A feature is deemed useless when it is uninformative about the target or redundant.<p>Check out this post for a formulation of three key properties of a feature using Shapley values: feature importance, feature usefulness, and feature potential. <a href="https://blog.kxy.ai/feature-engineering-with-game-theory-beyond-shap/" rel="nofollow">https://blog.kxy.ai/feature-engineering-with-game-theory-bey...</a><p>In the scenario the blog post deals with (i.e. "I have (generated) a lot of features; let me do PCA and train my model using the top few principal components"), data scientists do both feature construction (full PCA, i.e. projecting the original input onto eigenvectors to obtain as many principal components as the dimension of the original input) AND feature selection (only selecting the first few principal components with the highest eigenvalues).<p>When the goal is to predict y from x, using PCA for either feature construction OR feature selection is a bad idea!<p>For feature construction, there is nothing in PCA that will intrinsically guarantee that a linear combination of coordinates of x will have a simpler relationship to y than x itself. PCA does not even use y in this case! E.g. Imagine all coordinates of x but the first are pure noise (as far as predicting y is concerned). Any linear combination of x will just make your inputs noisy!<p>For feature selection, even assuming principal components make sense as features, principal components with the highest variances (i.e. corresponding to the highest eigenvalues) need not be the most useful for predicting y! High variance does not imply high signal.