TechEcho

14 comments

mattnedrichabout 9 years ago

I wrote an article about mean-shift a while back if anyone is interested in more details about it - <a href="https://spin.atomicobject.com/2015/05/26/mean-shift-clustering/" rel="nofollow">https://spin.atomicobject.com/2015/05/26/mean-shift-clusteri...</a>Some comments on K-Means - one large limitation of K-Means is that it assumes spherical shaped clusters. It will fail terribly for any other cluster shape.It's interesting that the author compared results on the same data set for the different algorithms. Each clustering approach is going to work best on a specific type of data set. It would be interesting to compare them across several different data sets to get a better feel for strengths/weaknesses, etc.

评论 #11611824 未加载

评论 #11613785 未加载

jeyabout 9 years ago

This is very specific to 2D data. I bet the story is a lot different for high-dimensional data. The challenges you encounter with this sort of clumping in 2D is unlikely to occur in high-dimensional data due to the "curse" of dimensionality. Clustering in high dimensions has its own quirks and gotchas too, but they're quite distinct from the gotchas of low-dimensionality.Here's more from an actual expert: <a href="http://research.microsoft.com/en-US/people/kannan/book-chapter-2.pdf" rel="nofollow">http://research.microsoft.com/en-US/people/kannan/book-chapt...</a>

评论 #11613703 未加载

merusameabout 9 years ago

One of the better comparisons of Clustering Algorithms online. It's also worth checking out how HDBScan works: <a href="http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb" rel="nofollow">http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/mas...</a>

makmanalpabout 9 years ago

These all look like clustering based on a 2d space, but does anyone know methods to tackle clustering on a network?Is it just a matter of tweaking the definition of density / distance to the number of hops, or is it a different problem entirely? I can see how with 0 or 1 hops the data would be a very smushed distribution, versus 2d distance is much more rich and spread out.

评论 #11612711 未加载

评论 #11613247 未加载

评论 #11613060 未加载

Xcelerateabout 9 years ago

I would say there are two aspects of clustering that are important: accuracy of the model, and accuracy of the model fitting process.A model is a guess about the underlying process that "generates" the data. If you're trying to use hyperplanes to divide data that lies on a manifold, then you are going to have poor results no matter how good your fitting algorithm is.On the other hand, even if you know the true model, high levels of noise can prevent you from recovering the correct parameters. For instance, Max-Cut is NP-hard, and the best we can do is a semidefinite programming approximation. Beyond a certain noise threshold, the gap between the SDP solution and the true solution becomes very large very quickly.

评论 #11611554 未加载

bmordueabout 9 years ago

This is a great resource. As someone who knows little about clustering, this was clear and very informative. It covers potential pitfalls much better than other similar documents I've seen, which is a useful approach.

kaspersetabout 9 years ago

how about t-SNE for clustering? <a href="https://lvdmaaten.github.io/tsne/" rel="nofollow">https://lvdmaaten.github.io/tsne/</a>

评论 #11612331 未加载

评论 #11612239 未加载

popraabout 9 years ago

As someone intrested in the subject but with 0 insight or knowledge, would these algorithms be a good match for short text clustering? For example identifying identical products in a price comparison app based on their similar but not identical title/name and possibly other attributes.

评论 #11610433 未加载

评论 #11610608 未加载

评论 #11612786 未加载

Wei-1about 9 years ago

Why not use Density Cluster based on this 2014 Science Paper? <a href="http://science.sciencemag.org/content/344/6191/1492" rel="nofollow">http://science.sciencemag.org/content/344/6191/1492</a>

评论 #11614100 未加载

评论 #11612377 未加载

leecarraherabout 9 years ago

It would be interesting to see what Agglomerative Clustering the author is using here. I suspect for this two dimensional, density based cluster dataset, single-link agglomerative would perform much better than what is shown (likely average link).

评论 #11611652 未加载

ameliusabout 9 years ago

Interesting, but the subtitle "Why you should use HDBSCAN" makes little sense on a dataset of N=1.

评论 #11611187 未加载

chestervonwinchabout 9 years ago

Cool. I have not looked into the *DBSCAN methods, yet. This post makes me think I should.

评论 #11611928 未加载

dweinusabout 9 years ago

Great article! Does anyone know of an implementation of HDBSCAN for R?

评论 #11614217 未加载

graycatabout 9 years ago

Can find such comparisons byGlenn W. MilliganOhio Stateback to at least 1980.

14 comments

mattnedrichabout 9 years ago

评论 #11611824 未加载

评论 #11613785 未加载

jeyabout 9 years ago

评论 #11613703 未加载

merusameabout 9 years ago

makmanalpabout 9 years ago

评论 #11612711 未加载

评论 #11613247 未加载

评论 #11613060 未加载

Xcelerateabout 9 years ago

评论 #11611554 未加载

bmordueabout 9 years ago

kaspersetabout 9 years ago

how about t-SNE for clustering? <a href="https://lvdmaaten.github.io/tsne/" rel="nofollow">https://lvdmaaten.github.io/tsne/</a>

评论 #11612331 未加载

评论 #11612239 未加载

popraabout 9 years ago

评论 #11610433 未加载

评论 #11610608 未加载

评论 #11612786 未加载

Wei-1about 9 years ago

Why not use Density Cluster based on this 2014 Science Paper? <a href="http://science.sciencemag.org/content/344/6191/1492" rel="nofollow">http://science.sciencemag.org/content/344/6191/1492</a>

评论 #11614100 未加载

评论 #11612377 未加载

leecarraherabout 9 years ago

评论 #11611652 未加载

ameliusabout 9 years ago

Interesting, but the subtitle "Why you should use HDBSCAN" makes little sense on a dataset of N=1.

评论 #11611187 未加载

chestervonwinchabout 9 years ago

Cool. I have not looked into the *DBSCAN methods, yet. This post makes me think I should.

评论 #11611928 未加载

dweinusabout 9 years ago

Great article! Does anyone know of an implementation of HDBSCAN for R?

评论 #11614217 未加载

graycatabout 9 years ago

Can find such comparisons byGlenn W. MilliganOhio Stateback to at least 1980.

Comparing Clustering Algorithms

14 comments

Comparing Clustering Algorithms

14 comments