I got into quant finance 12 years ago with the mistaken idea that I was going to successfully use all these cool machine learning techniques (genetic programming! SVMs! neural networks!) to run great statistical arbitrage books.<p>Most machine learning techniques focus on problems where the signal is very strong, but the structure is very complex. For instance, take the problem of recognizing whether a picture is a picture of a bird. A human will do well on this task, which shows that there is very little intrinsic noise. However, the correlation of any given pixel with the class of the image is essentially 0. The "noise" is in discovering the unknown relationship between pixels and class, not in the actual output.<p>Noise dominates everything you will find in statistical arbitrage. R^2 of 1% <i>are</i> something to write home about. With this amount of noise, it's generally hard to do much better than a linear regression. Any model complexity has to come from integrating over latent parameters or manual feature engineering, the rest will overfit.<p>I think Geoffrey Hinton said that statistics and machine learning are really the same thing, but since we have two different names for it, we might as well call machine learning everything that focuses on dealing with problems with a complex structure and low noise, and statistics everything that focuses on dealing with problems with a large amount of noise. I like this distinction, and I did end up picking up a lot of statistics working in this field.<p>I'll regularly get emails from friends who tried some machine learning technique on some dataset and found promising results. As the article points out, these generally don't hold up. Accounting for every source of bias in a backtest is an art. The most common mistake is to assume that you can observe the relative price of two stocks at the close, and trade at that price. Many pairs trading strategies appear to work if you make this assumption (which tends to be the case if all you have are daily bars), but they really do not. Others include: assuming transaction costs will be the same on average (they won't, your strategy likely detects opportunities at time were the spread is very large and prices are bad), assuming index memberships don't change (they do and that creates selection bias), assuming you can short anything (stocks can be hard to short or have high borrowing costs), etc.<p>In general, statistical arbitrage isn't machine learning bound(1), and it is not a data mining endeavor. Understanding the latent market dynamics you are trying to capitalize on, finding new data feeds that provide valuable information, carefully building out a model to test your hypothesis, deriving a sound trading strategy from that model is how it works.<p>(1: this isn't always true. For instance, analyzing news with NLP, or using computer vision to estimate crop outputs from satellite imagery can make use of machine learning techniques to yield useful, tradeable signals. My comment mostly focuses on machine learning applied to price information. )