Hoshen–Kopelman algorithm
This article needs additional citations for verification. (September 2016) |
Part of a series on |
Machine learning and data mining |
---|
The Hoshen–Kopelman algorithm is simple and efficient algorithm for labeling clusters on a grid. Where the grid is a regular network of cells,with the cells being either occupied or unoccupied. This algorithm is based on well-known union-finding algorithm[1]. The algorithm was originally described in by J. Hoshen and R. Kopelman in their 1976 paper <a href="http://journals.aps.org/prb/abstract/10.1103/PhysRevB.14.3438" target="_blank">Percolation and Cluster Distribution. I. Cluster Multiple Labeling Technique and Critical Concentration Algorithm</a>.[2]
Percolation theory
Percolation theory is the study of the behavior and statistics of clusters on lattices. Suppose we have a large square lattice where each cell can be occupied with the probability p and empty with probability 1 – p. Each group of neighboring occupied cells forms a cluster. Neighbors are defined as cells having a common side but not those sharing only a corner i.e. we consider 4x4 neighborhood. (top, bottom, left, right). Each occupied cell is occupied independently of the status of its neighborhood. The number of clusters, a size of each cluster and their distribution are important topics in percolation theory.
|
|
Consider 5x5 grids in figure (a) and (b).In figure (a), the probability of occupancy is p = 6/25 = 0.24 .In figure (b), the probability of occupancy is p = 16/25 = 0.64 .
|
Hoshen–Kopelman algorithm for cluster finding
In this algorithm we scan through a grid looking for occupied cells and labeling them with cluster labels. The algorithm begins with scanning the grid cell by cell and check if the cell is occupied, if the cell is occupied then this cell must be labelled with a cluster label. This cluster label is decided based on the neighbors of the cell which have been previously scanned and labelled, and if the cell doesn’t have any occupied neighbors then new label is assigned to the cell.
Union-find algorithm
The union-find algorithm is a simple method for computing equivalence classes. Calling the function union(x,y)
specifies that items x
and y
are members of the same equivalence class. Because equivalence relations are transitive; all items equivalent to x
are equivalent to all items equivalent to y
. Thus for any item x
, there is a set of items which are all equivalent to x
are; this set is the equivalence class of which x
is a member. A second function find(x)
returns a representative member of the equivalence class to which x
belongs.
Pseudo-code
So in the raster scan of the grid in question, each time an occupied cell is encountered, a check is done to see whether this cell has any neighboring cells who have already been scanned. If so, first a union
operation is performed, to specify that these neighboring cells are in fact members of the same equivalence class. Then a find
operation is performed to find a representative member of that equivalence class with which to label the current cell. If on the other hand, the current cell has no neighbors, it is assigned a new, previously unused, label. The entire grid is processed in this way. The grid can be raster-scanned a second time, performing only find
operations at each cell, to re-label the cells with their final assignment of a representative element.
Following pseudo-code is referred from one of the University of California Berkeley's projects.[3]
Raster Scan and Labeling on the Grid largest_label = 0; for x in 0 to n_columns { for y in 0 to n_rows { if occupied[x,y] then left = occupied[x-1,y]; above = occupied[x,y-1]; if (left == 0) and (above == 0) then /* Neither a label above nor to the left. */ largest_label = largest_label + 1; /* Make a new, as-yet-unused cluster label. */ label[x,y] = largest_label; else if (left != 0) and (above == 0) then /* One neighbor, to the left. */ label[x,y] = find(left); else if (left == 0) and (above != 0) then /* One neighbor, above. */ label[x,y] = find(above); else /* Neighbors BOTH to the left and above. */ union(left,above); /* Link the left and above clusters. */ label[x,y] = find(left); } }
Union void union(int x, int y) { labels[find(x)] = find(y); }
Find int find(int x) { int y = x; while (labels[y] != y) y = labels[y]; while (labels[x] != x) { int z = labels[x]; labels[x] = y; x = z; } return y; }
Example
Consider the following example. The dark cells in the grid in figure (a)
represent that they are occupied and the white ones are empty. So by running H–K algorithm on this input we would get the output as shown in figure (b)
with all the clusters labeled.
The algorithm processes the input grid, cell by cell, as follows: Let's say that grid is a two-dimensional array.
- Starting from cell
grid[0][0]
i.e. the first cell. The cell is occupied and it doesn't have cells to the left or above so we will label this cell with a new label say1
. grid[0][1]
andgrid[0][2]
both are unoccupied so they are not labeled.grid[0][3]
is occupied so check cell to the left which is unoccupied so we increment the current label value and assign the label to the cell as2
.grid[0][4]
,grid[0][5]
andgrid[1][0]
are unoccupied so they are not labeled.grid[1][1]
is occupied so check cell to the left and above, both the cells are unoccupied so assign new label3
.grid[1][2]
is occupied so check cell to the left and above, only cell to the left is occupied so assign the label of cell on left to this cell3
.grid[1][3]
is occupied so check cell to the left and above, both the cells are occupied, so merge the two clusters and assign the cluster label of cell above to the cell on left and to this cell i.e.2
. (Merging using union algorithm will label all the cells with label3
to2
)grid[1][4]
is occupied so check cell to the left and above, only cell to the left is occupied so assign the label of cell on left to this cell2
.grid[1][5]
,grid[2][0]
andgrid[2][1]
are unoccupied so they are not labeled.grid[2][2]
is occupied so check cell to the left and above, only cell above is occupied so assign the label of cell above to this cell2
.grid[2][3]
,grid[2][4]
andgrid[2][5]
are unoccupied so they are not labeled.grid[3][0]
is occupied so check cell above which is unoccupied so we increment the current label value and assign the label to the cell as4
.grid[3][1]
is occupied so check cell to the left and above, only cell to the left is occupied so assign the label of cell on left to this cell4
.grid[3][2]
is unoccupied so it is not labeled.grid[3][3]
is occupied so check cell to the left and above, both the cells are unoccupied so assign new label5
.grid[3][4]
is occupied so check cell to the left and above, only cell to the left is occupied so assign the label of cell on left to this cell5
.grid[3][5]
,grid[4][0]
andgrid[4][1]
are unoccupied so they are not labeled.grid[4][2]
is occupied so check cell to the left and above, both the cells are unoccupied so assign new label6
.grid[4][3]
is occupied so check cell to the left and above, both, cell to the left and above are occupied so merge the two clusters and assign the cluster label of cell above to the cell on left and to this cell i.e.5
. (Merging using union algorithm will label all the cells with label6
to5
).grid[4][4]
is unoccupied so it is not labeled.grid[4][5]
is occupied so check cell to the left and above, both the cells are unoccupied so assign new label7
.grid[5][0]
,grid[5][1]
,grid[5][2]
andgrid[5][3]
are unoccupied so they are not labeled.grid[5][4]
is occupied so check cell to the left and above, both the cells are unoccupied so assign new label8
.grid[5][5]
is occupied so check cell to the left and above, both, cell to the left and above are occupied so merge the two clusters and assign the cluster label of cell above to the cell on left and to this cell i.e.7
. (Merging using union algorithm will label all the cells with label8
to7
).
|
|
Consider 6x6 grids in figure (a) and (b).Figure (a), This is the input to the Hoshen–Kopelman algorithm. Figure (b), This is the output of the algorithm with all the clusters labeled. |
Applications
- Segmentation and Clustering of a Binary Image[4]
- Determination of Nodal Domain Area and Nodal Line Lengths[5]
- Nodal Connectivity Information
- Modeling of electrical conduction.
More clustering algorithms
- K-means clustering algorithm
- Fuzzy clustering algorithm
- Gaussian (EM) clustering algorithm
References
- ^ https://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf
- ^ Hoshen, J.; Kopelman, R. (15 October 1976). "Percolation and cluster distribution. I. Cluster multiple labeling technique and critical concentration algorithm". Phys. Rev. B. 14 (8): 3438–3445. doi:10.1103/PhysRevB.14.3438 – via APS.
- ^ "The Hoshen-Kopelman Algorithm". Ocf.berkeley.edu. 2004-04-21. Retrieved 2016-09-17.
- ^ "Journal of Applied Sciences" (PDF). Scialert.net. ISSN 1812-5654. Retrieved 2016-09-17.
- ^ Christian Joas. "Introduction to the Hoshen-Kopelman algorithm and its application to nodal domain statistics" (PDF). Webhome.weizmann.ac.il. Retrieved 2016-09-17.