Jump to content

Canopy clustering algorithm

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Miniapolis (talk | contribs) at 16:04, 2 April 2011 (Wikification, removed tag). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

The canopy clustering algorithm, in computing, is an unsupervised clustering algorithm related to the K-means algorithm.

It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set.

The algorithm proceeds as follows:

Cheaply partitioning the data into overlapping subsets (called "canopies")
Perform more expensive clustering, but only within these canopies

Benefits

The number of instances of training data that must be compared at each step is reduced
There is some evidence that the resulting clusters are improved^[1]

References

^ Mahout description of Canopy-Clustering Retrieved 2011-04-02.

External link

McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching"

This algorithms or data structures-related article is a stub. You can help Wikipedia by expanding it.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Canopy_clustering_algorithm&oldid=421994721"

Hidden category:

All stub articles