Talk:K-means clustering

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics
???	This article has not yet received a rating on Wikipedia's content assessment scale.
???	This article has not yet received a rating on the importance scale.

Robotics Start‑class Mid‑importance

	This article is within the scope of WikiProject Robotics, a collaborative effort to improve the coverage of Robotics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.RoboticsWikipedia:WikiProject RoboticsTemplate:WikiProject RoboticsRobotics
Start	This article has been rated as Start-class on Wikipedia's content assessment scale.
Mid	This article has been rated as Mid-importance on the project's importance scale.

Mathematics Start‑class Low‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics
Start	This article has been rated as Start-class on Wikipedia's content assessment scale.
Low	This article has been rated as Low-priority on the project's priority scale.

Suggestion: Make title to lowercase (I believe k-means is the consensus and not K-means) --Chrike 03:04, 19 January 2007 (UTC)[reply]

Some examples? --Abdull 16:32, 21 February 2006 (UTC)[reply]

== K-means+ ==

Why was the previous comment on the talk page criticizing k-means++ deleted? Anyway, it just seems to be another of the zillion approaches there have been at making a previous centroid initilization before the algorithm proper starts. Genetic Algorithms, Particle Swarm, Simulated Annealing... I would want to question the relevance of the section to the article. —Preceding unsigned comment added by 189.135.63.207 (talk) 17:27, 21 December 2008 (UTC)[reply]

I also agree with this. If people want there should be a separate section on seeding methods. The wiki page should not be used as an advertisement like this —Preceding unsigned comment added by 76.168.200.176 (talk) 07:28, 26 January 2009 (UTC)[reply]

I am David Arthur, one of the two authors of that paper. It was not me who wrote the k-means++ section here. I can, however, address the discuss comments here and in the past:

1. The paper was published at one of the main CS theory conferences, arguably the most prestigious among theory conferences that claim to be seriously interested in practical applications. It is not a seminal work that has been cited tens of thousands of times, but it is not obscure either.

2. The paper deals very specifically with k-means, unlike say simulated annealing, and thus makes more sense to reference on this page than simulated annealing. To my knowledge, the only other serious research that has been done on the question of seeding k-means was done at the same time and independently by Ostrovsky et al (see citation and discussion in my paper). The gist of what's going on is they analyze essentially the same algorithm, and show it has a good approximation ratio in certain cases. My paper shows the algorithm has a slightly less good approximation ratio in all cases and also includes some experiments. I do think it would be appropriate to cite that result as well, although it's much more complicated to state. I am unaware of other research results that make sense to reference here.

3. The paper proves its theoretical result with math, and does so in all cases, not only in some cases like the deleted comment accuses. The experiments on real-life data-sets are from the standard repository of real-life data sets that almost all academic papers use. I assure you they were not chosen in bad faith, and since sample code for k-means++ is available as reference in the paper, you are welcome to judge for yourself.

I am not a Wikipedia writer and of course I'm biased and I like having my paper referenced here, so I do not presume to make any decisions on notability. However, I did want to clear up these points.71.198.184.231 (talk) 20:15, 11 March 2009 (UTC)[reply]

Spherical clusters

I assume that the comment in the end of the article ("Also, the algorithm works well only when spherical clusters are naturally available in data.") is very much dependent on the distance metric used... --Andkaha^(talk) 15:52, 20 July 2006 (UTC)[reply]

This isn't the k-means algorithm!

Did anyone bother to lookup the original reference?

The algorithm, described by wikipedia, is known as the Lloyd-algorithm. It is described in "An Algorithm for Vector Quantizer Design" by Linde and Buzo and Gray, 1980.

The original k-means algorithm was described by MacQueen ("Some Methods for classification and Analysis of Multivariate Observations", 1967) as follows

Stated informally, the k-means procedure consitsts of simply starting with k groups each of which consists of a single random point, and thereafter adding each new point to the group whose mean the new point is nearest. After a point is added to a group, the mean of that group is adjusted in order to take account of the new point. Thus at each stage the k-means are, in fact, the means of the groups they represent (hence the term k-means).

Every data point is processed only once and only one cluster mean is recomputed everytime a new data point is added. This is hardly the same algorithm as the one described by wikipedia.

This is exactly what the Lloyd clustering does, except that we need to switch to Euclidean space (the algorithm famously works for nutrient grouping etc, but it is difficult to visualize): Centroid = mean; k = number of clusters; cluster = group; k-means is for k groups. The informal procedure you quoted is a different interpretation, although it does the exact same thing: In the mentioned method, you already have the points given. You don't have to add them one by one; instead you add the group centers one by one. This is exactly the same. You still associate each point with the nearest site (or the center to distinguish it from a point). Take k sites. Given n points, these k random sites are placed near the k of the n possible points (usually you select k points and make them the sites). For every point, figure out which of the k sites is the nearest; that point goes to that group. You will have at most k groups now (some groups may be left unassigned if no point lies near to the site). Now recalculate the mean, reposition the k sites; repeat lather rinse. I don't understand how the quoted verbatim differs from this; it describes this exactly. Please clarify. Also provide a link to the PDF or article where this MacQueen paper is, I couldn't find it using Scholar. User_talk:gfxguy

I'm afraid there's very little correlation between the example and the description, making the example all but useless for someone who's trying to understand what's going on. "The algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data." I assume that the value of k was chosen as 4. Where then are the 4 initial clusters?

"It then calculates the mean point, or centroid, of each set." Where is this done?

"It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters," Each frame should correspond with a step of the algorithm

Although MacQueen's original k-means algorithm may not be identical to Lloyd's method, the algorithm described in this article is identical to Lloyd's method. Moreover, both articles reflect the common understanding of the k-means algorithm, at least within the algorithms community. The first sentence of the the Arthur and Vassilviskii paper cited in this article is "The k-means method is a well known geometric clustering algorithm based on work by Lloyd in 1982." I've therefore proposed a merge. (Perhaps the merged article should include a clear explanation of MacQueen's algorithm.) — Jeﬀ Erickson ✍ 04:41, 6 April 2007 (UTC)[reply]

I agree with the merge. Lloyd's algorithm is the most common algorithm for approaching the k-means problem. Sancho 08:51, 9 April 2007 (UTC)[reply]

If K-means is more general I don't see why they articles should be merged. There also seems to be a fair amount of active research on new approaches to K-means (from a survey of Google Scholar articles), so I think it would be a mistake to merge. Imran 05:07, 12 April 2007 (UTC)[reply]

Of course, k-means algorithm is used synonymously with Lloyd's algorithm by many. "The most prominent and widely used clustering algorithm is Lloyd's algorithms sometimes also referred to as the k-means algorithm." (From Frahling, G. and Sohler, C. 2006. A fast k-means implementation using coresets. In Proceedings of the Twenty-Second Annual Symposium on Computational Geometry (Sedona, Arizona, USA, June 05 - 07, 2006). SCG '06. ACM Press, New York, NY, 135-143. DOI= http://doi.acm.org/10.1145/1137856.1137879). Sancho 08:55, 9 April 2007 (UTC)[reply]

I liked Sancho's comment. I found an article supporting other methods to solve the K-means problem and referenced it and also clarified the difference between K-means and Lloyd's as well as creating a reference to Lloyd's. Weston.pace 21:56, 12 June 2007 (UTC)[reply]

Image Replacement

Originally I started working on replacement images for the examples because they weren't licensed properly and I figured replacement was easier than tracking down the owner. However, they have sinced been properly licensed but I went ahead and replaced them because the new images don't show external software (which I beleive detracts from the quality of the image) and are in .svg format. However, I could be mistaken and do not wish to unfairly promote my own work, so if someone preferred the old images they can revert. -192.25.240.225 19:30, 26 July 2007 (UTC)[reply]

Why does the article put so much emphasis on the K-Means++ algorithm? If this algorithm really was so awesome as the text suggest then I would expect the "k-means++ The Advantages of Careful Seeding" article to be often quoted by scientific articles. It is not, and I think the reason is quite simple. It is an obscure article that "prove" its results using very specialized datasets that does not match what K-Means is used for out in the wild. I seriously doubt the results will hold if K-means++ is tried in more general situations. —Preceding unsigned comment added by 81.216.185.173 (talk) 10:32, 29 November 2008 (UTC)[reply]

Title

I propose changing the title to K-means clustering for two reasons:

It explains what it does (it clusters around k means)
There appear to be a number of algorithms, so referring to it as "K-means algorithm" is a misnomer.

comments? -3mta3 (talk) 09:11, 7 April 2009 (UTC)[reply]

I keep seeing a lot of references that state the problem is either NP-hard or NP-complete, but have yet to find a proof. The most common reference is:

Brucker, Peter (1978). "On the complexity of clustering problems". Optimization and operations research (Proc. Workshop, Univ. Bonn, Bonn, 1977) (Lecture Notes in Economics and Mathematical Systems 157 ed.). Berlin: Springer. MR0511326. Zbl 0397.68044. ISBN 3-540-08842-3. {{cite conference}}: Text "pages 45–54" ignored (help)

but I have been unable to find a copy. Is anyone else aware of any other sources? -3mta3 (talk) 15:31, 15 April 2009 (UTC)[reply]

http://www.cc.gatech.edu/~vempala/papers/dfkvv.pdf (NP-hard even if k=2), http://www.imsc.res.in/~meena/papers/kmeans.pdf (NP-hard even if d=2). I do not think the second paper has been published 71.198.184.231 (talk) 22:23, 23 April 2009 (UTC)[reply]

thanks, i put the first one and the original reference (for anyone who can find it!) in the article. -3mta3 (talk) 08:36, 24 April 2009 (UTC)[reply]

Filtering algorithm

I am not sure how to go about adding this, but I think this article should reference Kanungo et al's kd-tree filtering algorithm (http://www.cs.umd.edu/~mount/Papers/pami02.pdf). It speeds a naive implementation up by a couple orders of magnitude. 71.198.184.231 (talk) 22:32, 23 April 2009 (UTC)[reply]

There is a reference to it (see K-means_clustering#Notes) however no description of what it does. Feel free to expand. -3mta3 (talk) 08:15, 24 April 2009 (UTC)[reply]

Ok, I expanded the variations section some -- I'm not super-sure my summary of the last one is good since only the abstract is online. I hadn't heard of it before. 71.198.184.231 (talk) 16:43, 24 April 2009 (UTC)[reply]

A nitpick

I think it is somewhat misleading to say using variance and Euclidean distance are drawbacks of k-means, and it's very misleading to say this is what makes k-means fast. These choices are what makes k-means *well defined*. What makes it *fast* is that it does not attempt to find the best clustering even according to its own definitions. Any clustering algorithm has to implicitly define what it means by a "good" clustering, and k-means's choice of Euclidean distance + variance is not obviously better or worse than other choices. So yes, if people use k-means assuming it has a human-like understanding of when two unknown objects are similar, they will be disappointed. But, no algorithm does this, and there is certainly no standard belief that an algorithm ever will be able to do this. Meanwhile, the description implies there is a clear better metric which k-means doesn't use for speed reasons - I do not think this is true and without real evidence, I do not see how the claim belongs on wikipedia in its current form.198.161.30.159 (talk) 06:29, 30 June 2009 (UTC)[reply]

The demonstration is complete?

As far as I have understood, the last frame is not the result of the algorithm. I think that the result should be a part of the demonstration, too. Bubla (talk) 11:25, 24 August 2009 (UTC)[reply]