Cluster analysis
Appearance
Data clustering is a common technique for data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering consists of partitioning a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often similarity or proximity for some defined distance measure.
Data clustering algorithms can be hierarchical or partitional, and hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down).
Applications
In biology has two main applications in the fields of computational biology and bioinformatics.
- In proteomics, clustering is used to build groups of proteins with related expression patterns. Often such groups contain functionally related proteins, and thus high throughput experiments using expressed sequence tags (ESTs) can be a powerful tool for genome annotation, a general aspect of genomics.
- In sequence analysis, clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication.
References
- Greg Pfister: In Search of Clusters, Prentice Hall, ISBN 0138997098
- Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv, 1999