Flajolet–Martin algorithm

The Flajolet-Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption which is logarithmic in the maximum number of possible distinct elements in the stream. The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 paper "Probabilistic Counting Algorithms for Data Base Applications".^[1]. Later it has been refined in the papers "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet^[2], and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al^[3]

In their 2010 paper "An optimal algorithm for the distinct elements problem"^[4], Daniel M. Kane, Jelani Nelson and David P. Woodruff gives an improved algorithm which uses nearly optimal space, and has optimal O(1) update and reporting times.

The algorithm

Assume that we are given a hash function $hash(x)$ which maps input $x$ to integers in the range $[0;2^{L-1}]$ and where the outputs are sufficiently uniformly distributed. Note that the set of integers from 0 to $2^{L}-1$ corresponds to the set of binary strings of length $L$ . For any non-negative integer $y$ , define $bit(y,k)$ to be the $k$ -th bit in the binary representation of $y$ , such that:

$y=\sum _{k\geq 0}{\text{bit}}(y,k)2^{k}$

We then define a function $\rho (y)$ which outputs the position of the least significant 1-bit in the binary representation of $y$ :

$\rho (y)=\min _{k\geq 0}{\text{bit}}(y,k)\neq 0$

where $\rho (0)=L$ . Note that with the above definition we are using 0-indexing for the positions. For example, $\rho (13)=\rho (1101)=0$ since the least significant bit is a 1, and $\rho (8)=\rho (0100)=2$ since the least significant 1-bit is at the third position. At this point, note that under the assumption that the output of our hash-function is uniformly distributed, then the probability of observing a hash-output ending with $10^{k}$ (a one, followed by $k$ zeroes) is $2^{-(k+1)}$ since this corresponds to flipping $k$ heads and then a tail with a fair coin.

Now the Flajolet-Martin algorithm for estimating the cardinality of a multiset $M$ is as follows:

Initialize a bit-vector BITMAP to be of length $L$ and contain all 0's.
For each element $x$ $x$ in $M$ $M$ :
1. index = $\rho ({\text{hash}}(x))$ .
2. $BITMAP[index]=1$ .
Let $R$ denote the smallest index $i$ such that $BITMAP[i]=0$ .
Estimate the cardinality of $M$ as $2^{R}\cdot \phi$ where $\phi \approx 0.77351$ .

The idea is that if $n$ is the number of distinct elements in the multiset $M$ , then $BITMAP[0]$ is accessed approximately $n/2$ times, $BITMAP[1]$ is accessed approximately $n/4$ times and so on. Consequently if $i>>\log _{2}n$ then $BITMAP[i]$ is almost certainly 0, and if $i<<\log _{2}n$ then $BITMAP[i]$ is almost certainly 1. If $i\approx \log _{2}n$ then $BITMAP[i]$ can be expected to be either 1 or 0.

The correction factor $\phi \approx 0.77351$ is found by calculations which can be found in the original paper.

Improving accuracy

A problem with the Flajolet-Martin algorithm in the above form, is that the results vary a lot. A common solution is to run the algorithm multiple times with $k$ different hash-functions, and combine the results from the different runs. One idea is to take the mean of the $k$ results together from each hash-function, obtaining a single estimate of the cardinality. The problem with this is that averaging is very susceptible to outliers (which are likely here). A different idea is to use the median which is less prone to be influences by outliers. The problem with this is that the results can only take form $2^{R}/\phi$ , where $R$ is integer. A common solution is to combine both the mean and the median: Create $k\cdot \ell$ hash-functions and split them into $k$ distinct groups (each of size $\ell$ ). Within each group use the median for aggregating together the $\ell$ results, and finally take the mean of the $k$ group estimates as the final estimate.

References

^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/0022-0000(85)90041-8, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1016/0022-0000(85)90041-8 instead.
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1007/978-3-540-39658-1_55, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1007/978-3-540-39658-1_55 instead.
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1.1.214.4277, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1.1.214.4277 instead.
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1145/1807085.1807094, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1145/1807085.1807094 instead.

Additional sources

Rajaraman, Anand; Ullman, Jeffrey David (2011-10-27). Mining of Massive Datasets. Cambridge University Press. pp. 119–. ISBN 9781139505345. Retrieved 9 November 2014.

[1] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1016/0022-0000(85)90041-8, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1016/0022-0000(85)90041-8 instead.

[2] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1007/978-3-540-39658-1_55, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1007/978-3-540-39658-1_55 instead.

[3] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1.1.214.4277, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1.1.214.4277 instead.

[4] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1145/1807085.1807094, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1145/1807085.1807094 instead.

[1]

[2]

[3]

[4]