Nearest centroid classifier

In machine learning, a nearest centroid or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation.

When applied to text classification using tf*idf vectors to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the Rocchio algorithm for relevance feedback.^[1]

An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of tumors.^[2]

Algorithm

Training procedure: given labeled training samples $\textstyle \{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}$ with class labels $y_{i}\in \mathbf {Y}$ , compute the per-class centroids $\textstyle {\vec {\mu _{l}}}={\frac {1}{|C_{l}|}}{\underset {i\in C_{l}}{\sum }}{\vec {x}}_{i}$ where $C_{l}$ is the set of indices of samples belonging to class $l\in \mathbf {Y}$ .
Prediction function: the class assigned to an observation ${\vec {x}}$ is ${\hat {y}}={\arg \min }_{l\in \mathbf {Y} }\|{\vec {\mu }}_{l}-{\vec {x}}\|$

References

^ Manning, Christopher; Raghavan, Prabhakar; Schütze, Hinrich (2008). "Vector space classification". Introduction to Information Retrieval. Cambridge University Press.
^ Tibshirani, Robert; Hastie, Trevor; Narasimhan, Balasubramanian; Chu, Gilbert (2002). "Diagnosis of multiple cancer types by shrunken centroids of gene expression". Proceedings of the National Academy of Sciences. 99 (10).

[1] Manning, Christopher; Raghavan, Prabhakar; Schütze, Hinrich (2008). "Vector space classification". Introduction to Information Retrieval. Cambridge University Press.

[2] Tibshirani, Robert; Hastie, Trevor; Narasimhan, Balasubramanian; Chu, Gilbert (2002). "Diagnosis of multiple cancer types by shrunken centroids of gene expression". Proceedings of the National Academy of Sciences. 99 (10).

[1]

[2]

Algorithm

See also

References