科技回声

14 条评论

I wrote an article about mean-shift a while back if anyone is interested in more details about it - <a href="https://spin.atomicobject.com/2015/05/26/mean-shift-clustering/" rel="nofollow">https://spin.atomicobject.com/2015/05/26/mean-shift-clusteri...</a>Some comments on K-Means - one large limitation of K-Means is that it assumes spherical shaped clusters. It will fail terribly for any other cluster shape.It's interesting that the author compared results on the same data set for the different algorithms. Each clustering approach is going to work best on a specific type of data set. It would be interesting to compare them across several different data sets to get a better feel for strengths/weaknesses, etc.

评论 #11611824 未加载

评论 #11613785 未加载

jey大约 9 年前

This is very specific to 2D data. I bet the story is a lot different for high-dimensional data. The challenges you encounter with this sort of clumping in 2D is unlikely to occur in high-dimensional data due to the "curse" of dimensionality. Clustering in high dimensions has its own quirks and gotchas too, but they're quite distinct from the gotchas of low-dimensionality.Here's more from an actual expert: <a href="http://research.microsoft.com/en-US/people/kannan/book-chapter-2.pdf" rel="nofollow">http://research.microsoft.com/en-US/people/kannan/book-chapt...</a>

评论 #11613703 未加载

merusame大约 9 年前

One of the better comparisons of Clustering Algorithms online. It's also worth checking out how HDBScan works: <a href="http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb" rel="nofollow">http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/mas...</a>

makmanalp大约 9 年前

These all look like clustering based on a 2d space, but does anyone know methods to tackle clustering on a network?Is it just a matter of tweaking the definition of density / distance to the number of hops, or is it a different problem entirely? I can see how with 0 or 1 hops the data would be a very smushed distribution, versus 2d distance is much more rich and spread out.

评论 #11612711 未加载

评论 #11613247 未加载

评论 #11613060 未加载

Xcelerate大约 9 年前

I would say there are two aspects of clustering that are important: accuracy of the model, and accuracy of the model fitting process.A model is a guess about the underlying process that "generates" the data. If you're trying to use hyperplanes to divide data that lies on a manifold, then you are going to have poor results no matter how good your fitting algorithm is.On the other hand, even if you know the true model, high levels of noise can prevent you from recovering the correct parameters. For instance, Max-Cut is NP-hard, and the best we can do is a semidefinite programming approximation. Beyond a certain noise threshold, the gap between the SDP solution and the true solution becomes very large very quickly.

评论 #11611554 未加载

bmordue大约 9 年前

This is a great resource. As someone who knows little about clustering, this was clear and very informative. It covers potential pitfalls much better than other similar documents I've seen, which is a useful approach.

kasperset大约 9 年前

how about t-SNE for clustering? <a href="https://lvdmaaten.github.io/tsne/" rel="nofollow">https://lvdmaaten.github.io/tsne/</a>

评论 #11612331 未加载

评论 #11612239 未加载

popra大约 9 年前

As someone intrested in the subject but with 0 insight or knowledge, would these algorithms be a good match for short text clustering? For example identifying identical products in a price comparison app based on their similar but not identical title/name and possibly other attributes.

评论 #11610433 未加载

评论 #11610608 未加载

评论 #11612786 未加载

Wei-1大约 9 年前

Why not use Density Cluster based on this 2014 Science Paper? <a href="http://science.sciencemag.org/content/344/6191/1492" rel="nofollow">http://science.sciencemag.org/content/344/6191/1492</a>

评论 #11614100 未加载

评论 #11612377 未加载

leecarraher大约 9 年前

It would be interesting to see what Agglomerative Clustering the author is using here. I suspect for this two dimensional, density based cluster dataset, single-link agglomerative would perform much better than what is shown (likely average link).

评论 #11611652 未加载

amelius大约 9 年前

Interesting, but the subtitle "Why you should use HDBSCAN" makes little sense on a dataset of N=1.

评论 #11611187 未加载

chestervonwinch大约 9 年前

Cool. I have not looked into the *DBSCAN methods, yet. This post makes me think I should.

评论 #11611928 未加载

dweinus大约 9 年前

Great article! Does anyone know of an implementation of HDBSCAN for R?

评论 #11614217 未加载

graycat大约 9 年前

Can find such comparisons byGlenn W. MilliganOhio Stateback to at least 1980.

14 条评论

mattnedrich大约 9 年前

评论 #11611824 未加载

评论 #11613785 未加载

jey大约 9 年前

评论 #11613703 未加载

merusame大约 9 年前

makmanalp大约 9 年前

评论 #11612711 未加载

评论 #11613247 未加载

评论 #11613060 未加载

Xcelerate大约 9 年前

评论 #11611554 未加载

bmordue大约 9 年前

kasperset大约 9 年前

how about t-SNE for clustering? <a href="https://lvdmaaten.github.io/tsne/" rel="nofollow">https://lvdmaaten.github.io/tsne/</a>

评论 #11612331 未加载

评论 #11612239 未加载

popra大约 9 年前

评论 #11610433 未加载

评论 #11610608 未加载

评论 #11612786 未加载

Wei-1大约 9 年前

Why not use Density Cluster based on this 2014 Science Paper? <a href="http://science.sciencemag.org/content/344/6191/1492" rel="nofollow">http://science.sciencemag.org/content/344/6191/1492</a>

评论 #11614100 未加载

评论 #11612377 未加载

leecarraher大约 9 年前

评论 #11611652 未加载

amelius大约 9 年前

Interesting, but the subtitle "Why you should use HDBSCAN" makes little sense on a dataset of N=1.

评论 #11611187 未加载

chestervonwinch大约 9 年前

Cool. I have not looked into the *DBSCAN methods, yet. This post makes me think I should.

评论 #11611928 未加载

dweinus大约 9 年前

Great article! Does anyone know of an implementation of HDBSCAN for R?

评论 #11614217 未加载

graycat大约 9 年前

Can find such comparisons byGlenn W. MilliganOhio Stateback to at least 1980.

Comparing Clustering Algorithms

14 条评论

Comparing Clustering Algorithms

14 条评论