TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Comparing Clustering Algorithms

178 pointsby merusameabout 9 years ago

14 comments

mattnedrichabout 9 years ago
I wrote an article about mean-shift a while back if anyone is interested in more details about it - <a href="https:&#x2F;&#x2F;spin.atomicobject.com&#x2F;2015&#x2F;05&#x2F;26&#x2F;mean-shift-clustering&#x2F;" rel="nofollow">https:&#x2F;&#x2F;spin.atomicobject.com&#x2F;2015&#x2F;05&#x2F;26&#x2F;mean-shift-clusteri...</a><p>Some comments on K-Means - one large limitation of K-Means is that it assumes spherical shaped clusters. It will fail terribly for any other cluster shape.<p>It&#x27;s interesting that the author compared results on the same data set for the different algorithms. Each clustering approach is going to work best on a specific type of data set. It would be interesting to compare them across several different data sets to get a better feel for strengths&#x2F;weaknesses, etc.
评论 #11611824 未加载
评论 #11613785 未加载
jeyabout 9 years ago
This is very specific to 2D data. I bet the story is a lot different for high-dimensional data. The challenges you encounter with this sort of clumping in 2D is unlikely to occur in high-dimensional data due to the &quot;curse&quot; of dimensionality. Clustering in high dimensions has its own quirks and gotchas too, but they&#x27;re quite distinct from the gotchas of low-dimensionality.<p>Here&#x27;s more from an actual expert: <a href="http:&#x2F;&#x2F;research.microsoft.com&#x2F;en-US&#x2F;people&#x2F;kannan&#x2F;book-chapter-2.pdf" rel="nofollow">http:&#x2F;&#x2F;research.microsoft.com&#x2F;en-US&#x2F;people&#x2F;kannan&#x2F;book-chapt...</a>
评论 #11613703 未加载
merusameabout 9 years ago
One of the better comparisons of Clustering Algorithms online. It&#x27;s also worth checking out how HDBScan works: <a href="http:&#x2F;&#x2F;nbviewer.jupyter.org&#x2F;github&#x2F;lmcinnes&#x2F;hdbscan&#x2F;blob&#x2F;master&#x2F;notebooks&#x2F;How%20HDBSCAN%20Works.ipynb" rel="nofollow">http:&#x2F;&#x2F;nbviewer.jupyter.org&#x2F;github&#x2F;lmcinnes&#x2F;hdbscan&#x2F;blob&#x2F;mas...</a>
makmanalpabout 9 years ago
These all look like clustering based on a 2d space, but does anyone know methods to tackle clustering on a network?<p>Is it just a matter of tweaking the definition of density &#x2F; distance to the number of hops, or is it a different problem entirely? I can see how with 0 or 1 hops the data would be a very smushed distribution, versus 2d distance is much more rich and spread out.
评论 #11612711 未加载
评论 #11613247 未加载
评论 #11613060 未加载
Xcelerateabout 9 years ago
I would say there are two aspects of clustering that are important: accuracy of the model, and accuracy of the model fitting process.<p>A model is a guess about the underlying process that &quot;generates&quot; the data. If you&#x27;re trying to use hyperplanes to divide data that lies on a manifold, then you are going to have poor results no matter how good your fitting algorithm is.<p>On the other hand, even if you know the true model, high levels of noise can prevent you from recovering the correct parameters. For instance, Max-Cut is NP-hard, and the best we can do is a semidefinite programming approximation. Beyond a certain noise threshold, the gap between the SDP solution and the true solution becomes very large very quickly.
评论 #11611554 未加载
bmordueabout 9 years ago
This is a great resource. As someone who knows little about clustering, this was clear and very informative. It covers potential pitfalls much better than other similar documents I&#x27;ve seen, which is a useful approach.
kaspersetabout 9 years ago
how about t-SNE for clustering? <a href="https:&#x2F;&#x2F;lvdmaaten.github.io&#x2F;tsne&#x2F;" rel="nofollow">https:&#x2F;&#x2F;lvdmaaten.github.io&#x2F;tsne&#x2F;</a>
评论 #11612331 未加载
评论 #11612239 未加载
popraabout 9 years ago
As someone intrested in the subject but with 0 insight or knowledge, would these algorithms be a good match for short text clustering? For example identifying identical products in a price comparison app based on their similar but not identical title&#x2F;name and possibly other attributes.
评论 #11610433 未加载
评论 #11610608 未加载
评论 #11612786 未加载
Wei-1about 9 years ago
Why not use Density Cluster based on this 2014 Science Paper? <a href="http:&#x2F;&#x2F;science.sciencemag.org&#x2F;content&#x2F;344&#x2F;6191&#x2F;1492" rel="nofollow">http:&#x2F;&#x2F;science.sciencemag.org&#x2F;content&#x2F;344&#x2F;6191&#x2F;1492</a>
评论 #11614100 未加载
评论 #11612377 未加载
leecarraherabout 9 years ago
It would be interesting to see what Agglomerative Clustering the author is using here. I suspect for this two dimensional, density based cluster dataset, single-link agglomerative would perform much better than what is shown (likely average link).
评论 #11611652 未加载
ameliusabout 9 years ago
Interesting, but the subtitle &quot;Why you should use HDBSCAN&quot; makes little sense on a dataset of N=1.
评论 #11611187 未加载
chestervonwinchabout 9 years ago
Cool. I have not looked into the *DBSCAN methods, yet. This post makes me think I should.
评论 #11611928 未加载
dweinusabout 9 years ago
Great article! Does anyone know of an implementation of HDBSCAN for R?
评论 #11614217 未加载
graycatabout 9 years ago
Can find such comparisons by<p>Glenn W. Milligan<p>Ohio State<p>back to at least 1980.