TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Generalized K-Means Clustering

192 pointsby derrickrburnsover 1 year ago

11 comments

minimaxirover 1 year ago
I built a pipeline to automatically cluster and visualize large amounts of text documents in a completely unsupervised manner:<p>- Embed all the text documents.<p>- Project to 2D using UMAP which also creates its own emergent &quot;clusters&quot;.<p>- Use k-means clustering with a high cluster count depending on dataset size.<p>- Feed the ChatGPT API ~10 examples from each cluster and ask it to provide a concise label for the cluster.<p>- Bonus: Use DBSCAN to identify arbitrary subclusters within each cluster.<p>It is <i>extremely</i> effective and I have a theoetical implementation of a more practical use case to use said UMAP dimensionality reduction for better inference. There is evidence that current popular text embedding models (e.g. OpenAI ada, which outputs 1536D embeddings) are <i>way</i> too big for most use cases and could be giving poorly specified results for embedding similarity as a result, in addition to higher costs for the entire pipeline.
评论 #38977592 未加载
评论 #38977327 未加载
评论 #38976856 未加载
评论 #38977004 未加载
评论 #38977718 未加载
评论 #38976805 未加载
评论 #38976818 未加载
评论 #38977057 未加载
评论 #38976865 未加载
评论 #38977923 未加载
derrickrburnsover 1 year ago
AI has sparked new interest in high dimensional embeddings for approximate nearest neighbor search. Here is a highly scalable, implementation of a companion technique, k-means clustering that uses Spark 1.1 written in Scala.<p>Please let me know if you fork this library and update it to the latter versions of Spark.
评论 #38977689 未加载
zoogenyover 1 year ago
There is a Twitch streamer Tsoding who posted a video of himself implementing K-means clustering in C recently [1]. He also does a follow up 3d visualization of the algorithm in progress using raylib [2].<p>1. <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=kH-hqG34ylA&amp;t=4788s&amp;ab_channel=TsodingDaily" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=kH-hqG34ylA&amp;t=4788s&amp;ab_chann...</a><p>2. <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=K7hWqxC_7Mw&amp;ab_channel=TsodingDaily" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=K7hWqxC_7Mw&amp;ab_channel=Tsodi...</a>
namuolover 1 year ago
Here’s a very simple toy demonstration of how K-Means works that I made for fun years ago while studying machine learning: <a href="https:&#x2F;&#x2F;k-means.stackblitz.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;k-means.stackblitz.io&#x2F;</a><p>Essentially K-Means is a way of “learning” categories or other kinds of groupings within an unlabeled dataset, without any fancy deep learning. It’s handy for its simplicity and speed.<p>The demo works with simple 2D coordinates for illustrative purposes but the technique works with any number of dimensions.<p>Note that there may be some things I got wrong with the implementation and that there are other variations of the algorithm surely, but it still captures the basic idea well enough for an intro.
milofeynmanover 1 year ago
I applied to a certain scraping fintech in the Bay Area around 5 years ago and was asked to open the Wikipedia page to k-means squared clustering and implement the algorithm with tests from scratch. I was applying for an android position. I still laugh thinking about how they paid to fly me out and ask such a stupid interview question.
评论 #38977441 未加载
staticautomaticover 1 year ago
What are people using k-means for? I can count on one hand the number of times I’ve had a good <i>a priori</i> rationale for the value of k.
评论 #38977884 未加载
评论 #38977325 未加载
评论 #38976965 未加载
评论 #38977083 未加载
评论 #38976939 未加载
评论 #38976677 未加载
评论 #38976642 未加载
评论 #38976928 未加载
评论 #38977554 未加载
评论 #38976723 未加载
评论 #38976522 未加载
评论 #38976669 未加载
评论 #38976758 未加载
评论 #38976656 未加载
评论 #38977818 未加载
评论 #38976610 未加载
评论 #38977200 未加载
评论 #38976628 未加载
评论 #38976639 未加载
评论 #38976814 未加载
评论 #38976598 未加载
评论 #38976521 未加载
Scene_Cast2over 1 year ago
Although K-means clustering is often the correct approach given time crunch and code complexity constraints, I don&#x27;t like how it&#x27;s hard to extend and how it&#x27;s not principled. By not principled, I mean that it feels more like an algorithm (that happens to optimize) rather than an explicit optimization with an explicit loss function. And I found that in practice, modifying the distance function to anything more interesting doesn&#x27;t work.
评论 #38977591 未加载
评论 #38977927 未加载
atum47over 1 year ago
I remember when i first learned k-means, it opened the door for so many projects. Two that are on my GitHub to this day are a python script that groups your images by similarly (histogram) and one that classify your expenses based on previous data. I had so much fun working on those.
Scene_Cast2over 1 year ago
Check out sampling with lightweight coresets if your data is big - it&#x27;s a principled approach with theoretical guarantees, and it&#x27;s only a couple of lines of numpy. Do check if the assumptions hold for your data though, as they are stronger than with regular coresets.
评论 #38978168 未加载
anArbitraryOneover 1 year ago
Fun fact: K-Means is the least interesting clustering algorithm known to humans, but is quite fast and therefore useful in certain applications
评论 #38978036 未加载
fiddlerwoaroofover 1 year ago
Does the “curse of dimensionality” affect the usefulness of k-means?