Another good dimensionality reduction technique to consider is Latent Dirichlet Allocation. I use this approach for natural language or other "bursty" data sets. "Bursty" data sets are characterized by having Zipfian distribution over features, but certain long-tail features achieving a higher probability of multiple observations given initial observation in an instance. For example, "armadillo" is relatively rare, but an article mentioning an armadillo once has a high chance of mentioning it again.<p>A cool thing about LDA is that it allows you to express the latent characteristics of a given document as a point in Euclidean space. This gives you the ability to use spatial distance metrics such as cosine distance to express document similarity. I specifically use this for recommending large-scale UGC communities based on their latent characteristics. Furthermore, since you've turned your language data into spatial data, you're able to use spatial classifiers such as SVMs more effectively over natural language data, which is normally a bit better suited for Bayesian classifiers.<p>I'm a huge fan of Gensim for its LDA library. It's even capable of distributed computing using Pyro4. It's relatively trivial to deploy an LDA pipeline for extremely large datasets using EC2 and the Boto AWS library.<p>Edit: If you haven't heard of it, scikit-learn is an awesome Python library for highly performant machine learning using Python's C extensions for numerical computing (scipy, numpy). It's easy to take the data you get above and perform learning on it using the classifiers provided.