TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Comparing Clustering Algorithms

178 点作者 merusame大约 9 年前

14 条评论

mattnedrich大约 9 年前
I wrote an article about mean-shift a while back if anyone is interested in more details about it - <a href="https:&#x2F;&#x2F;spin.atomicobject.com&#x2F;2015&#x2F;05&#x2F;26&#x2F;mean-shift-clustering&#x2F;" rel="nofollow">https:&#x2F;&#x2F;spin.atomicobject.com&#x2F;2015&#x2F;05&#x2F;26&#x2F;mean-shift-clusteri...</a><p>Some comments on K-Means - one large limitation of K-Means is that it assumes spherical shaped clusters. It will fail terribly for any other cluster shape.<p>It&#x27;s interesting that the author compared results on the same data set for the different algorithms. Each clustering approach is going to work best on a specific type of data set. It would be interesting to compare them across several different data sets to get a better feel for strengths&#x2F;weaknesses, etc.
评论 #11611824 未加载
评论 #11613785 未加载
jey大约 9 年前
This is very specific to 2D data. I bet the story is a lot different for high-dimensional data. The challenges you encounter with this sort of clumping in 2D is unlikely to occur in high-dimensional data due to the &quot;curse&quot; of dimensionality. Clustering in high dimensions has its own quirks and gotchas too, but they&#x27;re quite distinct from the gotchas of low-dimensionality.<p>Here&#x27;s more from an actual expert: <a href="http:&#x2F;&#x2F;research.microsoft.com&#x2F;en-US&#x2F;people&#x2F;kannan&#x2F;book-chapter-2.pdf" rel="nofollow">http:&#x2F;&#x2F;research.microsoft.com&#x2F;en-US&#x2F;people&#x2F;kannan&#x2F;book-chapt...</a>
评论 #11613703 未加载
merusame大约 9 年前
One of the better comparisons of Clustering Algorithms online. It&#x27;s also worth checking out how HDBScan works: <a href="http:&#x2F;&#x2F;nbviewer.jupyter.org&#x2F;github&#x2F;lmcinnes&#x2F;hdbscan&#x2F;blob&#x2F;master&#x2F;notebooks&#x2F;How%20HDBSCAN%20Works.ipynb" rel="nofollow">http:&#x2F;&#x2F;nbviewer.jupyter.org&#x2F;github&#x2F;lmcinnes&#x2F;hdbscan&#x2F;blob&#x2F;mas...</a>
makmanalp大约 9 年前
These all look like clustering based on a 2d space, but does anyone know methods to tackle clustering on a network?<p>Is it just a matter of tweaking the definition of density &#x2F; distance to the number of hops, or is it a different problem entirely? I can see how with 0 or 1 hops the data would be a very smushed distribution, versus 2d distance is much more rich and spread out.
评论 #11612711 未加载
评论 #11613247 未加载
评论 #11613060 未加载
Xcelerate大约 9 年前
I would say there are two aspects of clustering that are important: accuracy of the model, and accuracy of the model fitting process.<p>A model is a guess about the underlying process that &quot;generates&quot; the data. If you&#x27;re trying to use hyperplanes to divide data that lies on a manifold, then you are going to have poor results no matter how good your fitting algorithm is.<p>On the other hand, even if you know the true model, high levels of noise can prevent you from recovering the correct parameters. For instance, Max-Cut is NP-hard, and the best we can do is a semidefinite programming approximation. Beyond a certain noise threshold, the gap between the SDP solution and the true solution becomes very large very quickly.
评论 #11611554 未加载
bmordue大约 9 年前
This is a great resource. As someone who knows little about clustering, this was clear and very informative. It covers potential pitfalls much better than other similar documents I&#x27;ve seen, which is a useful approach.
kasperset大约 9 年前
how about t-SNE for clustering? <a href="https:&#x2F;&#x2F;lvdmaaten.github.io&#x2F;tsne&#x2F;" rel="nofollow">https:&#x2F;&#x2F;lvdmaaten.github.io&#x2F;tsne&#x2F;</a>
评论 #11612331 未加载
评论 #11612239 未加载
popra大约 9 年前
As someone intrested in the subject but with 0 insight or knowledge, would these algorithms be a good match for short text clustering? For example identifying identical products in a price comparison app based on their similar but not identical title&#x2F;name and possibly other attributes.
评论 #11610433 未加载
评论 #11610608 未加载
评论 #11612786 未加载
Wei-1大约 9 年前
Why not use Density Cluster based on this 2014 Science Paper? <a href="http:&#x2F;&#x2F;science.sciencemag.org&#x2F;content&#x2F;344&#x2F;6191&#x2F;1492" rel="nofollow">http:&#x2F;&#x2F;science.sciencemag.org&#x2F;content&#x2F;344&#x2F;6191&#x2F;1492</a>
评论 #11614100 未加载
评论 #11612377 未加载
leecarraher大约 9 年前
It would be interesting to see what Agglomerative Clustering the author is using here. I suspect for this two dimensional, density based cluster dataset, single-link agglomerative would perform much better than what is shown (likely average link).
评论 #11611652 未加载
amelius大约 9 年前
Interesting, but the subtitle &quot;Why you should use HDBSCAN&quot; makes little sense on a dataset of N=1.
评论 #11611187 未加载
chestervonwinch大约 9 年前
Cool. I have not looked into the *DBSCAN methods, yet. This post makes me think I should.
评论 #11611928 未加载
dweinus大约 9 年前
Great article! Does anyone know of an implementation of HDBSCAN for R?
评论 #11614217 未加载
graycat大约 9 年前
Can find such comparisons by<p>Glenn W. Milligan<p>Ohio State<p>back to at least 1980.