Correlation clustering

Clustering is the problem of partitioning data points into groups based on similarity or dissimilarity. Correlation clustering is a clustering framework in which a set of objects is partitioned into clusters based on pairwise similarity and dissimilarity information, without requiring the number of clusters to be specified in advance.^[1]

Description of the problem

In machine learning, correlation clustering (also known as cluster editing) considers settings in which pairwise similarity or dissimilarity relationships between objects are known. A standard formulation models the input as an unweighted complete graph $G=(V,E)$ , where each edge is labeled either $+$ or $-$ (that is, the graph is a signed graph), indicating whether the corresponding endpoints are similar or dissimilar.

The goal is to find a clustering (that is, a partition of $V$ ) that either maximizes the number of agreements—the sum of positive edges whose endpoints lie in the same cluster and negative edges whose endpoints lie in different clusters—or minimizes the number of disagreements—the sum of positive edges whose endpoints are separated and negative edges whose endpoints lie in the same cluster. Unlike other clustering methods such as k-means, correlation clustering does not require choosing the number of clusters $k$ in advance.

It is not always possible to find a clustering with zero disagreements. For example, consider a triangle graph containing two positive edges and one negative edge. In this case, every clustering incurs at least one disagreement. Such configurations are referred to in the literature as bad triangles.^[2]

From a computational perspective, optimizing the correlation clustering objective is challenging. The (decision version of the) problem is NP-complete.^[3] A large body of subsequent work has developed approximation algorithms for correlation clustering under various assumptions, including complete or general graphs and unweighted or weighted graphs, for both minimization and maximization objectives. This problem is considered one of the fundamental combinatorial optimization problems, and many algorithmic techniques have been developed to address it.

The problem has also been studied extensively across multiple disciplines. A comprehensive literature review of early correlation clustering research is provided by Wahid and Hassini.^[4]

Formal Definitions

Let $G=(V,E)$ be a graph with nodes $V$ and edges $E$ . A clustering of $G$ is a partition of its node set $\Pi =\{\pi _{1},\dots ,\pi _{k}\}$ with $V=\pi _{1}\cup \dots \cup \pi _{k}$ and $\pi _{i}\cap \pi _{j}=\emptyset$ for $i\neq j$ . For a given clustering $\Pi$ , let $\delta (\Pi )=\{\{u,v\}\in E\mid \{u,v\}\not \subseteq \pi \;\forall \pi \in \Pi \}$ denote the subset of edges of $G$ whose endpoints are in different subsets of the clustering $\Pi$ . Now, let $w\colon E\to \mathbb {R} _{\geq 0}$ be a function that assigns a non-negative weight to each edge of the graph and let $E=E^{+}\cup E^{-}$ be a partition of the edges into attractive ( $E^{+}$ ) and repulsive ( $E^{-}$ ) edges; that is, the edges are signed.

The minimum disagreement correlation clustering problem is the following optimization problem: ${\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in E^{+}\cap \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\setminus \delta (\Pi )}w_{e}\;.\end{aligned}}$ Here, the set $E^{+}\cap \delta (\Pi )$ contains the attractive edges whose endpoints are in different components with respect to the clustering $\Pi$ and the set $E^{-}\setminus \delta (\Pi )$ contains the repulsive edges whose endpoints are in the same component with respect to the clustering $\Pi$ . Together these two sets contain all edges that disagree with the clustering $\Pi$ .

Similarly to the minimum disagreement correlation clustering problem, the maximum agreement correlation clustering problem is defined as ${\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E^{+}\setminus \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\cap \delta (\Pi )}w_{e}\;.\end{aligned}}$ Here, the set $E^{+}\setminus \delta (\Pi )$ contains the attractive edges whose endpoints are in the same component with respect to the clustering $\Pi$ and the set $E^{-}\cap \delta (\Pi )$ contains the repulsive edges whose endpoints are in different components with respect to the clustering $\Pi$ . Together these two sets contain all edges that agree with the clustering $\Pi$ .

Instead of formulating the correlation clustering problem in terms of non-negative edge weights and a partition of the edges into attractive and repulsive edges the problem is also formulated in terms of positive and negative edge costs without partitioning the set of edges explicitly. For given weights $w\colon E\to \mathbb {R} _{\geq 0}$ and a given partition $E=E^{+}\cup E^{-}$ of the edges into attractive and repulsive edges, the edge costs can be defined by ${\begin{aligned}c_{e}={\begin{cases}\;\;w_{e}&{\text{if }}e\in E^{+}\\-w_{e}&{\text{if }}e\in E^{-}\end{cases}}\end{aligned}}$ for all $e\in E$ .

An edge whose endpoints are in different clusters is said to be cut. The set $\delta (\Pi )$ of all edges that are cut is often called a multicut^[5] of $G$ .

The minimum cost multicut problem is the problem of finding a clustering $\Pi$ of $G$ such that the sum of the costs of the edges whose endpoints are in different clusters is minimal: ${\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in \delta (\Pi )}c_{e}\;.\end{aligned}}$

Similar to the minimum cost multicut problem, coalition structure generation in weighted graph games^[6] is the problem of finding a clustering such that the sum of the costs of the edges that are not cut is maximal: ${\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E\setminus \delta (\Pi )}c_{e}\;.\end{aligned}}$ This formulation is also known as the clique partitioning problem.^[7]

It can be shown that all four problems that are formulated above are equivalent. This means that a clustering that is optimal with respect to any of the four objectives is optimal for all of the four objectives.

Algorithms

If the graph admits a clustering with zero disagreements, then deleting all negative edges and computing the connected components of the remaining graph yields an optimal clustering. A necessary and sufficient condition for the existence of such a clustering was given by Davis: no cycle in the graph may contain exactly one negative edge.^[8]

Bansal et al.^[9] discuss the NP-completeness proof and also present both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting. Ailon et al.^[10] propose a randomized 3-approximation algorithm for the same problem.

CC-Pivot(G=(V,E⁺,E⁻))
    Pick random pivot i ∈ V
    Set  $C=\{i\}$ , V'=Ø
    For all j ∈ V, j ≠ i;
        If (i,j) ∈ E⁺ then
            Add j to C
        Else (If (i,j) ∈ E⁻)
            Add j to V'
    Let G' be the subgraph induced by V'
    Return clustering C,CC-Pivot(G')

The authors show that the above algorithm is a 3-approximation algorithm for correlation clustering. The best polynomial-time approximation algorithm known at the moment for this problem achieves a ~2.06 approximation by rounding a linear program, as shown by Chawla, Makarychev, Schramm, and Yaroslavtsev.^[11]

Karpinski and Schudy^[12] proved existence of a polynomial time approximation scheme (PTAS) for that problem on complete graphs and fixed number of clusters.

Optimal number of clusters

In 2011, it was shown by Bagon and Galun^[13] that the optimization of the correlation clustering functional is closely related to well known discrete optimization methods. In their work they proposed a probabilistic analysis of the underlying implicit model that allows the correlation clustering functional to estimate the underlying number of clusters. This analysis suggests the functional assumes a uniform prior over all possible partitions regardless of their number of clusters. Thus, a non-uniform prior over the number of clusters emerges.

Several discrete optimization algorithms are proposed in this work that scales gracefully with the number of elements (experiments show results with more than 100,000 variables). The work of Bagon and Galun also evaluated the effectiveness of the recovery of the underlying number of clusters in several applications.

Correlation clustering (data mining)

Correlation clustering also relates to a different task, where correlations among attributes of feature vectors in a high-dimensional space are assumed to exist guiding the clustering process. These correlations may be different in different clusters, thus a global decorrelation cannot reduce this to traditional (uncorrelated) clustering.

Correlations among subsets of attributes result in different spatial shapes of clusters. Hence, the similarity between cluster objects is defined by taking into account the local correlation patterns. With this notion, the term has been introduced in ^[14] simultaneously with the notion discussed above. Different methods for correlation clustering of this type are discussed in ^[15] and the relationship to different types of clustering is discussed in.^[16] See also Clustering high-dimensional data.

Correlation clustering (according to this definition) can be shown to be closely related to biclustering. As in biclustering, the goal is to identify groups of objects that share a correlation in some of their attributes; where the correlation is usually typical for the individual clusters.

References

^ Becker, Hila, "A Survey of Correlation Clustering", 5 May 2005.
^ Ailon, Nir; Charikar, Moses; Newman, Alantha (2008). "Aggregating inconsistent information: ranking and clustering". Journal of the ACM. 55 (5). ACM: 1–27.
^ Bansal, Nikhil; Blum, Avrim; Chawla, Shuchi (2004). "Correlation clustering". Machine Learning. 56 (1–3). Springer: 89–113. doi:10.1023/B:MACH.0000033116.57574.95.
^ Wahid, Dewan F.; Hassini, Elkafi (2022). "A literature review on correlation clustering: cross-disciplinary taxonomy with bibliometric analysis". Operations Research Forum. 3 (3) 47. Springer.
^ Deza, M.; Grötschel, M.; Laurent, M. (1992). "Clique-Web Facets for Multicut Polytopes". Mathematics of Operations Research. 17 (4): 981–1000. doi:10.1287/moor.17.4.981.
^ Bachrach, Yoram; Kohli, Pushmeet; Kolmogorov, Vladimir; Zadimoghaddam, Morteza (2013). "Optimal coalition structure generation in cooperative graph games". Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 27. pp. 81–87.{{cite conference}}: CS1 maint: multiple names: authors list (link)
^ Grötschel, G.; Wakabayashi, Y. (1989). "A cutting plane algorithm for a clustering problem". Mathematical Programming. 45 (1–3): 59–96. doi:10.1007/BF01589097.
^ James A. Davis (1963). "Structural balance, mechanical solidarity, and interpersonal relations", American Journal of Sociology 68, 444–463.
^ Bansal, N.; Blum, A.; Chawla, S. (2004). "Correlation Clustering". Machine Learning. 56 (1–3): 89–113. doi:10.1023/B:MACH.0000033116.57574.95.
^ Ailon, N.; Charikar, M.; Newman, A. (2005). "Aggregating inconsistent information". Proceedings of the thirty-seventh annual ACM symposium on Theory of computing – STOC '05. p. 684. doi:10.1145/1060590.1060692. ISBN 1581139608.
^ Chawla, Shuchi; Makarychev, Konstantin; Schramm, Tselil; Yaroslavtsev, Grigory. "Near Optimal LP Rounding Algorithm for CorrelationClustering on Complete and Complete k-partite Graphs". Proceedings of the 46th Annual ACM on Symposium on Theory of Computing.
^ Karpinski, M.; Schudy, W. (2009). "Linear time approximation schemes for the Gale-Berlekamp game and related minimization problems". Proceedings of the 41st annual ACM symposium on Symposium on theory of computing – STOC '09. p. 313. arXiv:0811.3244. doi:10.1145/1536414.1536458. ISBN 9781605585062.
^ Bagon, S.; Galun, M. (2011) "Large Scale Correlation Clustering Optimization" arXiv:1112.2903v1
^ Böhm, C.; Kailing, K.; Kröger, P.; Zimek, A. (2004). "Computing Clusters of Correlation Connected objects". Proceedings of the 2004 ACM SIGMOD international conference on Management of data – SIGMOD '04. p. 455. CiteSeerX 10.1.1.5.1279. doi:10.1145/1007568.1007620. ISBN 978-1581138597. S2CID 6411037.
^ Zimek, A. (2008). Correlation Clustering (Text.PhDThesis). Ludwig-Maximilians-Universität München.
^ Kriegel, H. P.; Kröger, P.; Zimek, A. (2009). "Clustering high-dimensional data". ACM Transactions on Knowledge Discovery from Data. 3: 1–58. doi:10.1145/1497577.1497578. S2CID 17363900.

[1] Becker, Hila, "A Survey of Correlation Clustering", 5 May 2005.

[2] Ailon, Nir; Charikar, Moses; Newman, Alantha (2008). "Aggregating inconsistent information: ranking and clustering". Journal of the ACM. 55 (5). ACM: 1–27.

[3] Bansal, Nikhil; Blum, Avrim; Chawla, Shuchi (2004). "Correlation clustering". Machine Learning. 56 (1–3). Springer: 89–113. doi:10.1023/B:MACH.0000033116.57574.95.

[4] Wahid, Dewan F.; Hassini, Elkafi (2022). "A literature review on correlation clustering: cross-disciplinary taxonomy with bibliometric analysis". Operations Research Forum. 3 (3) 47. Springer.

[5] Deza, M.; Grötschel, M.; Laurent, M. (1992). "Clique-Web Facets for Multicut Polytopes". Mathematics of Operations Research. 17 (4): 981–1000. doi:10.1287/moor.17.4.981.

[6] Bachrach, Yoram; Kohli, Pushmeet; Kolmogorov, Vladimir; Zadimoghaddam, Morteza (2013). "Optimal coalition structure generation in cooperative graph games". Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 27. pp. 81–87.{{cite conference}}: CS1 maint: multiple names: authors list (link)

[7] Grötschel, G.; Wakabayashi, Y. (1989). "A cutting plane algorithm for a clustering problem". Mathematical Programming. 45 (1–3): 59–96. doi:10.1007/BF01589097.

[8] James A. Davis (1963). "Structural balance, mechanical solidarity, and interpersonal relations", American Journal of Sociology 68, 444–463.

[9] Bansal, N.; Blum, A.; Chawla, S. (2004). "Correlation Clustering". Machine Learning. 56 (1–3): 89–113. doi:10.1023/B:MACH.0000033116.57574.95.

[10] Ailon, N.; Charikar, M.; Newman, A. (2005). "Aggregating inconsistent information". Proceedings of the thirty-seventh annual ACM symposium on Theory of computing – STOC '05. p. 684. doi:10.1145/1060590.1060692. ISBN 1581139608.

[11] Chawla, Shuchi; Makarychev, Konstantin; Schramm, Tselil; Yaroslavtsev, Grigory. "Near Optimal LP Rounding Algorithm for CorrelationClustering on Complete and Complete k-partite Graphs". Proceedings of the 46th Annual ACM on Symposium on Theory of Computing.

[12] Karpinski, M.; Schudy, W. (2009). "Linear time approximation schemes for the Gale-Berlekamp game and related minimization problems". Proceedings of the 41st annual ACM symposium on Symposium on theory of computing – STOC '09. p. 313. arXiv:0811.3244. doi:10.1145/1536414.1536458. ISBN 9781605585062.

[13] Bagon, S.; Galun, M. (2011) "Large Scale Correlation Clustering Optimization" arXiv:1112.2903v1

[14] Böhm, C.; Kailing, K.; Kröger, P.; Zimek, A. (2004). "Computing Clusters of Correlation Connected objects". Proceedings of the 2004 ACM SIGMOD international conference on Management of data – SIGMOD '04. p. 455. CiteSeerX 10.1.1.5.1279. doi:10.1145/1007568.1007620. ISBN 978-1581138597. S2CID 6411037.

[15] Zimek, A. (2008). Correlation Clustering (Text.PhDThesis). Ludwig-Maximilians-Universität München.

[16] Kriegel, H. P.; Kröger, P.; Zimek, A. (2009). "Clustering high-dimensional data". ACM Transactions on Knowledge Discovery from Data. 3: 1–58. doi:10.1145/1497577.1497578. S2CID 17363900.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]