I built a pipeline to automatically cluster and visualize large amounts of text documents in a completely unsupervised manner:<p>- Embed all the text documents.<p>- Project to 2D using UMAP which also creates its own emergent "clusters".<p>- Use k-means clustering with a high cluster count depending on dataset size.<p>- Feed the ChatGPT API ~10 examples from each cluster and ask it to provide a concise label for the cluster.<p>- Bonus: Use DBSCAN to identify arbitrary subclusters within each cluster.<p>It is <i>extremely</i> effective and I have a theoetical implementation of a more practical use case to use said UMAP dimensionality reduction for better inference. There is evidence that current popular text embedding models (e.g. OpenAI ada, which outputs 1536D embeddings) are <i>way</i> too big for most use cases and could be giving poorly specified results for embedding similarity as a result, in addition to higher costs for the entire pipeline.
AI has sparked new interest in high dimensional embeddings for approximate nearest neighbor search. Here is a highly scalable, implementation of a companion technique, k-means clustering that uses Spark 1.1 written in Scala.<p>Please let me know if you fork this library and update it to the latter versions of Spark.
There is a Twitch streamer Tsoding who posted a video of himself implementing K-means clustering in C recently [1]. He also does a follow up 3d visualization of the algorithm in progress using raylib [2].<p>1. <a href="https://www.youtube.com/watch?v=kH-hqG34ylA&t=4788s&ab_channel=TsodingDaily" rel="nofollow">https://www.youtube.com/watch?v=kH-hqG34ylA&t=4788s&ab_chann...</a><p>2. <a href="https://www.youtube.com/watch?v=K7hWqxC_7Mw&ab_channel=TsodingDaily" rel="nofollow">https://www.youtube.com/watch?v=K7hWqxC_7Mw&ab_channel=Tsodi...</a>
Here’s a very simple toy demonstration of how K-Means works that I made for fun years ago while studying machine learning: <a href="https://k-means.stackblitz.io/" rel="nofollow">https://k-means.stackblitz.io/</a><p>Essentially K-Means is a way of “learning” categories or other kinds of groupings within an unlabeled dataset, without any fancy deep learning. It’s handy for its simplicity and speed.<p>The demo works with simple 2D coordinates for illustrative purposes but the technique works with any number of dimensions.<p>Note that there may be some things I got wrong with the implementation and that there are other variations of the algorithm surely, but it still captures the basic idea well enough for an intro.
I applied to a certain scraping fintech in the Bay Area around 5 years ago and was asked to open the Wikipedia page to k-means squared clustering and implement the algorithm with tests from scratch. I was applying for an android position. I still laugh thinking about how they paid to fly me out and ask such a stupid interview question.
Although K-means clustering is often the correct approach given time crunch and code complexity constraints, I don't like how it's hard to extend and how it's not principled. By not principled, I mean that it feels more like an algorithm (that happens to optimize) rather than an explicit optimization with an explicit loss function. And I found that in practice, modifying the distance function to anything more interesting doesn't work.
I remember when i first learned k-means, it opened the door for so many projects. Two that are on my GitHub to this day are a python script that groups your images by similarly (histogram) and one that classify your expenses based on previous data. I had so much fun working on those.
Check out sampling with lightweight coresets if your data is big - it's a principled approach with theoretical guarantees, and it's only a couple of lines of numpy. Do check if the assumptions hold for your data though, as they are stronger than with regular coresets.