TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The Curse of Dimensionality in Classification

92 pointsby lucasrpabout 11 years ago

7 comments

languagehackerabout 11 years ago
Another good dimensionality reduction technique to consider is Latent Dirichlet Allocation. I use this approach for natural language or other &quot;bursty&quot; data sets. &quot;Bursty&quot; data sets are characterized by having Zipfian distribution over features, but certain long-tail features achieving a higher probability of multiple observations given initial observation in an instance. For example, &quot;armadillo&quot; is relatively rare, but an article mentioning an armadillo once has a high chance of mentioning it again.<p>A cool thing about LDA is that it allows you to express the latent characteristics of a given document as a point in Euclidean space. This gives you the ability to use spatial distance metrics such as cosine distance to express document similarity. I specifically use this for recommending large-scale UGC communities based on their latent characteristics. Furthermore, since you&#x27;ve turned your language data into spatial data, you&#x27;re able to use spatial classifiers such as SVMs more effectively over natural language data, which is normally a bit better suited for Bayesian classifiers.<p>I&#x27;m a huge fan of Gensim for its LDA library. It&#x27;s even capable of distributed computing using Pyro4. It&#x27;s relatively trivial to deploy an LDA pipeline for extremely large datasets using EC2 and the Boto AWS library.<p>Edit: If you haven&#x27;t heard of it, scikit-learn is an awesome Python library for highly performant machine learning using Python&#x27;s C extensions for numerical computing (scipy, numpy). It&#x27;s easy to take the data you get above and perform learning on it using the classifiers provided.
评论 #7741140 未加载
评论 #7741060 未加载
vonnikabout 11 years ago
Neural nets are a great way to reduce dimensionality. In particular, deep autoencoders. here&#x27;s an old Hinton paper on it: <a href="http://www.cs.toronto.edu/~hinton/science.pdf" rel="nofollow">http:&#x2F;&#x2F;www.cs.toronto.edu&#x2F;~hinton&#x2F;science.pdf</a>
alceufcabout 11 years ago
I really like the idea of the Web site as a whole: explaining concepts from computer vision in a simple way.<p>When I was starting my masters course I was interested in learning what the concept of bag of words in computer vision was all about. Although it is straightforward technique, there are few examples on the Web explaining how to implement it (clustering the feature vectors and etc.)
评论 #7742192 未加载
Malarkey73about 11 years ago
I&#x27;m familiar with this idea - but it&#x27;s nice to see explained with cute little puppies and kittens.
therobot24about 11 years ago
one counter-example: face recognition using 100k features (<a href="http://research.microsoft.com/pubs/192106/HighDimFeature.pdf" rel="nofollow">http:&#x2F;&#x2F;research.microsoft.com&#x2F;pubs&#x2F;192106&#x2F;HighDimFeature.pdf</a>)
评论 #7740674 未加载
nraynaudabout 11 years ago
is there some kind of test to know if we are past the optimal number of dimensions? I guess overfitting could be detected by the ratio between volume and area of the classification boundary.
评论 #7741026 未加载
评论 #7741539 未加载
评论 #7741897 未加载
评论 #7740983 未加载
jbertabout 11 years ago
Can someone explain to me Figure 6 please? How does projecting the 3d space (and 2d green plane) lead to the regions around the cat&#x27;s heads?
评论 #7742980 未加载