TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The Curse of Dimensionality in Classification

92 点作者 lucasrp大约 11 年前

7 条评论

languagehacker大约 11 年前
Another good dimensionality reduction technique to consider is Latent Dirichlet Allocation. I use this approach for natural language or other &quot;bursty&quot; data sets. &quot;Bursty&quot; data sets are characterized by having Zipfian distribution over features, but certain long-tail features achieving a higher probability of multiple observations given initial observation in an instance. For example, &quot;armadillo&quot; is relatively rare, but an article mentioning an armadillo once has a high chance of mentioning it again.<p>A cool thing about LDA is that it allows you to express the latent characteristics of a given document as a point in Euclidean space. This gives you the ability to use spatial distance metrics such as cosine distance to express document similarity. I specifically use this for recommending large-scale UGC communities based on their latent characteristics. Furthermore, since you&#x27;ve turned your language data into spatial data, you&#x27;re able to use spatial classifiers such as SVMs more effectively over natural language data, which is normally a bit better suited for Bayesian classifiers.<p>I&#x27;m a huge fan of Gensim for its LDA library. It&#x27;s even capable of distributed computing using Pyro4. It&#x27;s relatively trivial to deploy an LDA pipeline for extremely large datasets using EC2 and the Boto AWS library.<p>Edit: If you haven&#x27;t heard of it, scikit-learn is an awesome Python library for highly performant machine learning using Python&#x27;s C extensions for numerical computing (scipy, numpy). It&#x27;s easy to take the data you get above and perform learning on it using the classifiers provided.
评论 #7741140 未加载
评论 #7741060 未加载
vonnik大约 11 年前
Neural nets are a great way to reduce dimensionality. In particular, deep autoencoders. here&#x27;s an old Hinton paper on it: <a href="http://www.cs.toronto.edu/~hinton/science.pdf" rel="nofollow">http:&#x2F;&#x2F;www.cs.toronto.edu&#x2F;~hinton&#x2F;science.pdf</a>
alceufc大约 11 年前
I really like the idea of the Web site as a whole: explaining concepts from computer vision in a simple way.<p>When I was starting my masters course I was interested in learning what the concept of bag of words in computer vision was all about. Although it is straightforward technique, there are few examples on the Web explaining how to implement it (clustering the feature vectors and etc.)
评论 #7742192 未加载
Malarkey73大约 11 年前
I&#x27;m familiar with this idea - but it&#x27;s nice to see explained with cute little puppies and kittens.
therobot24大约 11 年前
one counter-example: face recognition using 100k features (<a href="http://research.microsoft.com/pubs/192106/HighDimFeature.pdf" rel="nofollow">http:&#x2F;&#x2F;research.microsoft.com&#x2F;pubs&#x2F;192106&#x2F;HighDimFeature.pdf</a>)
评论 #7740674 未加载
nraynaud大约 11 年前
is there some kind of test to know if we are past the optimal number of dimensions? I guess overfitting could be detected by the ratio between volume and area of the classification boundary.
评论 #7741026 未加载
评论 #7741539 未加载
评论 #7741897 未加载
评论 #7740983 未加载
jbert大约 11 年前
Can someone explain to me Figure 6 please? How does projecting the 3d space (and 2d green plane) lead to the regions around the cat&#x27;s heads?
评论 #7742980 未加载