> it's selfish to keep beautiful discoveries a secret.<p>I found a beautiful thing recently and planned to do a write-up on it eventually, but I know I might get distracted. So I'll share the beauty here since I don't want to be selfish!<p>In K means clustering you know you've stabilized if centers t = centers (t-1). Stabilization has occurred because no clusters were reassigned during the lloyd iteration. People already know this. In many implementations of k means clustering you'll find this check in the body of the loop as a special case which means the loop should end. You can't have this as the condition of the while loop because you don't yet have a centers t-1 on your first loop. Actually you can by supposing a hypothetical all nil cluster definition prior to initialization, but people don't tend to do that. That failure to do that is ugly in the same way that Linus refers to code which uses special casing as being ugly. It doesn't apply the same procedure to every iteration. They should do that and it would make the code more beautiful. However, that is not my discovery, but just a preference for beauty and consistency.<p>What I noticed is that the equality check is actually giving you a bitset that tells you whether any of the centers was changed. This is a more general idea than just telling you that you can stop because you are done. It is telling you /why/ you aren't done. It is also deeply informative about the problem you are solving in a way that helps the computation to be done more efficiently. I want to show it being deeply informative. So I'll touch on that briefly and then we can revisit the simplicity.<p>Clusters being reassigned tells you the general location that have the potential to need future reassignment. For example, in the range of 1 to a 1,000,000 on a 1d line if a cluster at 10 moves, but there is a cluster at 500, then you know you don't need to look at reassignment for any cluster above 500. I mean this in two sense. One is that nothing in clusters past the 500 can change. So you don't need to look at them. The other is that clusters past the 500 cluster can't even be nearer. So you don't have to find the pairwise distance to them. In the assignment stage of the lloyd iteration you don't even need to look at everything above 500. So you not only reduce the amount you need to look at in the N dataset items. You also reduce the number of k clusters centers you need to compare them to. In the 1 to 1,000,000 domain example for stuff below 500 that is probably going to be more than 99% of your data that you can skip and the vast majority of clusters that you don't even to need to check distance for.<p>Returning to the simplicity discussion it means you can write the loop without the special casing. Instead of a break when stabilization has occurred you have a selection criteria function which tells you the selection criteria for that step of the lloyd iteration. Obviously at the initialization stage we went from no definitions to k definitions. So the selection criteria function is well defined even for the very first iteration on an intuitive level.<p>Why do I find this beautiful? Well, we can not only eliminate the special casing, which is beautiful on its own, but we can rephrase each iteration in terms of a selection criteria generated by that equality check! We are never special casing; the reason we stopped was always because the selection criteria was the empty set. We just didn't think of it that way, because we didn't phrase the update step in terms of the generation of a selection criteria for updates.<p>And when you do, suddenly it becomes obvious how to do certain parallelizations because your selection strategy tells you where to kick off another refinement iteration. And /locality/ in a dimensional space is determining where the updates get passed. I have this strange feeling that if we just keep pulling on this idea that we'll be able to eliminate the need for loops that await all cluster updates and instead express the computation in a massively parallel way that ends up taking advantage of the topological structure of the problem: I mean, clearly if you have two clusters that moved one at 5 and another at at 900900 you don't /need/ to wait for 5 to finish its refinement to know that it /isn't/ going to impact the next step for refinement at 900900, because there are so many clusters between them. So you should be able to proceed as if 5 cluster movement has no impact on 900900 cluster movement. Only if they drift closer and the topology differs do you have to backtrack, but since we already need to pass these updates through the topological structure we have a fairly straightforward way of declaring when it is appropriate to backtrack. This phrasing is really stupid for the toy problems that people solve in classrooms and when trying to understand things because of the overhead of keeping track of the work and the wasted work, but I have a feeling that it might be practical. In real massive problems you already have to pay the cost of keeping the work because stuff fails and you need to retry and in particular the geometric probability distrubition of failure is high enough that we just have to assume that stuff fails in these massive cases. So the added cost of keeping the work around during the computation isn't as extreme a barrier. It's basically optimistic massively parallelized clustering, but with a resolution protocol for how to handle two optimistic clustering runs which collide with each other, because the natural problem of scale forces redundancy on us effectively making the choice to be redundant free rather than expensive wasted work.<p>Maybe nothing will come of these thoughts, but I found the first thought pretty and it provoked the second line of reasoning, which I found interesting. I'm working on a k-means clustering system that incorporates the good ideas from several k means research papers and I plan to explore these ideas in my implementation, but in the spirit of not hiding beautiful things, I hope you enjoy.<p>Also, as an aside, these aren't completely new ideas. People have noticed that you can use the triangle inequality to speed up computation for a while and shown it to speed up computations. It's more of an observation of the way the looping structure can be seen in a non-special cased way, how that suggests ways to improve performance, and how it lends itself better to alternative control flow structures.<p>> it's selfish to keep beautiful discoveries a secret.<p>It would be really fun to read what others found beautiful that they've never heard someone else mention.