Correlation clustering
Correlation Clustering
Correlation Clustering operates in a scenario where the relationship between the objects is known instead of the actual representation of the objects. For example, given a fully connected graph where the edge label indicates whether two nodes are similar (+) or different (-), the task is to cluster the vertices such that similar objects are grouped together. Unlike other clustering algorithms this doesn't require the number of clusters to be specified because the objective is to maximize the disagreements and is independent of the number of clusters.
Notice that, it may not be possible to find a perfect clustering i.e. all the similar items are in a cluster while all dissimilar vertices are in different clusters. If the graph indeed admits a perfect clustering, then simply deleting all the negative edges and finding the connected components in the remaining graph will return the required clusters.
But, in general a graph may not have a perfect clustering. For example, given nodes {a,b,c} such that (a,b) and (a,c) are similar while (b,c) are dissimilar a perfect clustering is not possible. In such cases, the task is to find a clustering that maximizes the number of agreements (number of + edges inside clusters plus the number of - edges between clusters) or minimizes the number of disagreements (the number of - edges inside clusters plus the number of + edges between clusters). This problem of maximizing the agreements is NP-complete ( Multiway cut problem reduces to maximizing weighted agreements and the problem of partitioning into triangles[1] can be reduced to unweighted version)
Bansal et. al.[2] discusses the NP-completeness proof and also presents both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting. Nir et. al.[3] proposes a randomized 3-approximation algorithm for the same problem.
CC-Pivot(G=(V,E+,E-))
Pick random pivot i ∈ V
Set , V'=Ø
For all j ∈ V, j ≠ i;
If (i,j) ∈ E+ then
Add j to C
Else (If (i,j) ∈ E-)
Add j to V'
Let G' be the subgraph induced by V'
Return clustering C,CC-Pivot(G')
The authors show that the above algorithm is a 3-approximation algorithm for correlation clustering.
References
- ^ Garey, M. and Johnson, D. (2000). Computers and Intractability: A Guide to the Theory of NP-Completeness.
{{cite conference}}
: Unknown parameter|organization=
ignored (help)CS1 maint: multiple names: authors list (link) - ^ Bansal, N., Blum, A. and Chawla, S. (2004). "Correlation Clustering". Machine Learning Journal (Special Issue on Theoritical Advances in Data Clustering,. pp. 86–113, .
{{cite conference}}
: Unknown parameter|booktitle=
ignored (|book-title=
suggested) (help)CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link) - ^ Ailon, Nir and Charikar, Moses and Newman, Alantha (2005). "Aggregating inconsistent information: ranking and clustering". STOC '05: Proceedings of the thirty-seventh annual ACM symposium on Theory of computing. pp. 684–693, .
{{cite conference}}
: Unknown parameter|booktitle=
ignored (|book-title=
suggested) (help)CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)