科技回声

7 条评论

Another good dimensionality reduction technique to consider is Latent Dirichlet Allocation. I use this approach for natural language or other "bursty" data sets. "Bursty" data sets are characterized by having Zipfian distribution over features, but certain long-tail features achieving a higher probability of multiple observations given initial observation in an instance. For example, "armadillo" is relatively rare, but an article mentioning an armadillo once has a high chance of mentioning it again.<p>A cool thing about LDA is that it allows you to express the latent characteristics of a given document as a point in Euclidean space. This gives you the ability to use spatial distance metrics such as cosine distance to express document similarity. I specifically use this for recommending large-scale UGC communities based on their latent characteristics. Furthermore, since you've turned your language data into spatial data, you're able to use spatial classifiers such as SVMs more effectively over natural language data, which is normally a bit better suited for Bayesian classifiers.<p>I'm a huge fan of Gensim for its LDA library. It's even capable of distributed computing using Pyro4. It's relatively trivial to deploy an LDA pipeline for extremely large datasets using EC2 and the Boto AWS library.<p>Edit: If you haven't heard of it, scikit-learn is an awesome Python library for highly performant machine learning using Python's C extensions for numerical computing (scipy, numpy). It's easy to take the data you get above and perform learning on it using the classifiers provided.

评论 #7741140 未加载

评论 #7741060 未加载

vonnik大约 11 年前

Neural nets are a great way to reduce dimensionality. In particular, deep autoencoders. here's an old Hinton paper on it: <a href="http://www.cs.toronto.edu/~hinton/science.pdf" rel="nofollow">http://www.cs.toronto.edu/~hinton/science.pdf</a>

alceufc大约 11 年前

I really like the idea of the Web site as a whole: explaining concepts from computer vision in a simple way.<p>When I was starting my masters course I was interested in learning what the concept of bag of words in computer vision was all about. Although it is straightforward technique, there are few examples on the Web explaining how to implement it (clustering the feature vectors and etc.)

评论 #7742192 未加载

Malarkey73大约 11 年前

I'm familiar with this idea - but it's nice to see explained with cute little puppies and kittens.

therobot24大约 11 年前

one counter-example: face recognition using 100k features (<a href="http://research.microsoft.com/pubs/192106/HighDimFeature.pdf" rel="nofollow">http://research.microsoft.com/pubs/192106/HighDimFeature.pdf</a>)

评论 #7740674 未加载

nraynaud大约 11 年前

is there some kind of test to know if we are past the optimal number of dimensions? I guess overfitting could be detected by the ratio between volume and area of the classification boundary.

评论 #7741026 未加载

评论 #7741539 未加载

评论 #7741897 未加载

评论 #7740983 未加载

jbert大约 11 年前

Can someone explain to me Figure 6 please? How does projecting the 3d space (and 2d green plane) lead to the regions around the cat's heads?

评论 #7742980 未加载

7 条评论

languagehacker大约 11 年前

评论 #7741140 未加载

评论 #7741060 未加载

vonnik大约 11 年前

alceufc大约 11 年前

评论 #7742192 未加载

Malarkey73大约 11 年前

I'm familiar with this idea - but it's nice to see explained with cute little puppies and kittens.

therobot24大约 11 年前

评论 #7740674 未加载

nraynaud大约 11 年前

is there some kind of test to know if we are past the optimal number of dimensions? I guess overfitting could be detected by the ratio between volume and area of the classification boundary.

评论 #7741026 未加载

评论 #7741539 未加载

评论 #7741897 未加载

评论 #7740983 未加载

jbert大约 11 年前

Can someone explain to me Figure 6 please? How does projecting the 3d space (and 2d green plane) lead to the regions around the cat's heads?

评论 #7742980 未加载

The Curse of Dimensionality in Classification

7 条评论

The Curse of Dimensionality in Classification

7 条评论