科技回声

11 条评论

minimaxir超过 1 年前

I built a pipeline to automatically cluster and visualize large amounts of text documents in a completely unsupervised manner:- Embed all the text documents.- Project to 2D using UMAP which also creates its own emergent "clusters".- Use k-means clustering with a high cluster count depending on dataset size.- Feed the ChatGPT API ~10 examples from each cluster and ask it to provide a concise label for the cluster.- Bonus: Use DBSCAN to identify arbitrary subclusters within each cluster.It is extremely effective and I have a theoetical implementation of a more practical use case to use said UMAP dimensionality reduction for better inference. There is evidence that current popular text embedding models (e.g. OpenAI ada, which outputs 1536D embeddings) are way too big for most use cases and could be giving poorly specified results for embedding similarity as a result, in addition to higher costs for the entire pipeline.

评论 #38977592 未加载

评论 #38977327 未加载

评论 #38976856 未加载

评论 #38977004 未加载

评论 #38977718 未加载

评论 #38976805 未加载

评论 #38976818 未加载

评论 #38977057 未加载

评论 #38976865 未加载

评论 #38977923 未加载

derrickrburns超过 1 年前

AI has sparked new interest in high dimensional embeddings for approximate nearest neighbor search. Here is a highly scalable, implementation of a companion technique, k-means clustering that uses Spark 1.1 written in Scala.Please let me know if you fork this library and update it to the latter versions of Spark.

评论 #38977689 未加载

zoogeny超过 1 年前

There is a Twitch streamer Tsoding who posted a video of himself implementing K-means clustering in C recently [1]. He also does a follow up 3d visualization of the algorithm in progress using raylib [2].1. <a href="https://www.youtube.com/watch?v=kH-hqG34ylA&t=4788s&ab_channel=TsodingDaily" rel="nofollow">https://www.youtube.com/watch?v=kH-hqG34ylA&t=4788s&ab_chann...</a>2. <a href="https://www.youtube.com/watch?v=K7hWqxC_7Mw&ab_channel=TsodingDaily" rel="nofollow">https://www.youtube.com/watch?v=K7hWqxC_7Mw&ab_channel=Tsodi...</a>

namuol超过 1 年前

Here’s a very simple toy demonstration of how K-Means works that I made for fun years ago while studying machine learning: <a href="https://k-means.stackblitz.io/" rel="nofollow">https://k-means.stackblitz.io/</a>Essentially K-Means is a way of “learning” categories or other kinds of groupings within an unlabeled dataset, without any fancy deep learning. It’s handy for its simplicity and speed.The demo works with simple 2D coordinates for illustrative purposes but the technique works with any number of dimensions.Note that there may be some things I got wrong with the implementation and that there are other variations of the algorithm surely, but it still captures the basic idea well enough for an intro.

milofeynman超过 1 年前

I applied to a certain scraping fintech in the Bay Area around 5 years ago and was asked to open the Wikipedia page to k-means squared clustering and implement the algorithm with tests from scratch. I was applying for an android position. I still laugh thinking about how they paid to fly me out and ask such a stupid interview question.

评论 #38977441 未加载

staticautomatic超过 1 年前

What are people using k-means for? I can count on one hand the number of times I’ve had a good a priori rationale for the value of k.

评论 #38977884 未加载

评论 #38977325 未加载

评论 #38976965 未加载

评论 #38977083 未加载

评论 #38976939 未加载

评论 #38976677 未加载

评论 #38976642 未加载

评论 #38976928 未加载

评论 #38977554 未加载

评论 #38976723 未加载

评论 #38976522 未加载

评论 #38976669 未加载

评论 #38976758 未加载

评论 #38976656 未加载

评论 #38977818 未加载

评论 #38976610 未加载

评论 #38977200 未加载

评论 #38976628 未加载

评论 #38976639 未加载

评论 #38976814 未加载

评论 #38976598 未加载

评论 #38976521 未加载

Scene_Cast2超过 1 年前

Although K-means clustering is often the correct approach given time crunch and code complexity constraints, I don't like how it's hard to extend and how it's not principled. By not principled, I mean that it feels more like an algorithm (that happens to optimize) rather than an explicit optimization with an explicit loss function. And I found that in practice, modifying the distance function to anything more interesting doesn't work.

评论 #38977591 未加载

评论 #38977927 未加载

atum47超过 1 年前

I remember when i first learned k-means, it opened the door for so many projects. Two that are on my GitHub to this day are a python script that groups your images by similarly (histogram) and one that classify your expenses based on previous data. I had so much fun working on those.

Scene_Cast2超过 1 年前

Check out sampling with lightweight coresets if your data is big - it's a principled approach with theoretical guarantees, and it's only a couple of lines of numpy. Do check if the assumptions hold for your data though, as they are stronger than with regular coresets.

评论 #38978168 未加载

anArbitraryOne超过 1 年前

Fun fact: K-Means is the least interesting clustering algorithm known to humans, but is quite fast and therefore useful in certain applications

评论 #38978036 未加载

fiddlerwoaroof超过 1 年前

Does the “curse of dimensionality” affect the usefulness of k-means?

11 条评论

minimaxir超过 1 年前

评论 #38977592 未加载

评论 #38977327 未加载

评论 #38976856 未加载

评论 #38977004 未加载

评论 #38977718 未加载

评论 #38976805 未加载

评论 #38976818 未加载

评论 #38977057 未加载

评论 #38976865 未加载

评论 #38977923 未加载

derrickrburns超过 1 年前

评论 #38977689 未加载

zoogeny超过 1 年前

namuol超过 1 年前

milofeynman超过 1 年前

评论 #38977441 未加载

staticautomatic超过 1 年前

What are people using k-means for? I can count on one hand the number of times I’ve had a good a priori rationale for the value of k.

评论 #38977884 未加载

评论 #38977325 未加载

评论 #38976965 未加载

评论 #38977083 未加载

评论 #38976939 未加载

评论 #38976677 未加载

评论 #38976642 未加载

评论 #38976928 未加载

评论 #38977554 未加载

评论 #38976723 未加载

评论 #38976522 未加载

评论 #38976669 未加载

评论 #38976758 未加载

评论 #38976656 未加载

评论 #38977818 未加载

评论 #38976610 未加载

评论 #38977200 未加载

评论 #38976628 未加载

评论 #38976639 未加载

评论 #38976814 未加载

评论 #38976598 未加载

评论 #38976521 未加载

Scene_Cast2超过 1 年前

评论 #38977591 未加载

评论 #38977927 未加载

atum47超过 1 年前

Scene_Cast2超过 1 年前

评论 #38978168 未加载

anArbitraryOne超过 1 年前

Fun fact: K-Means is the least interesting clustering algorithm known to humans, but is quite fast and therefore useful in certain applications

评论 #38978036 未加载

fiddlerwoaroof超过 1 年前

Does the “curse of dimensionality” affect the usefulness of k-means?

Generalized K-Means Clustering

11 条评论

Generalized K-Means Clustering

11 条评论