Jump to content

User:Normalrobot33/sandbox

From Wikipedia, the free encyclopedia

In the field of streaming algorithms, Misra–Gries summaries are used to solve the frequent elements problem in the data stream model. That is, given a long stream of input that can only be examined once (and in some arbitrary order), the Misra-Gries algorithm[1] can be used to compute which (if any) value makes up a majority of the stream, or more generally, the set of items that constitute some fixed fraction of the stream.

The term "summary" is due to Graham Cormode.[2][3] The algorithm was presented by Misra and Gries alongside a different algorithm for finding frequent elements, the Misra–Gries heavy hitters algorithm.

The Misra–Gries summary

[edit]

As for all algorithms in the data stream model, the input is a finite sequence of integers from a finite domain. The algorithm outputs an associative array which has values from the stream as keys, and estimates of their frequency as the corresponding values. It takes a parameter k which determines the size of the array, which impacts both the quality of the estimates and the amount of memory used.

algorithm misra-gries:[4]
    input: 
        A positive integer k
        A finite sequence s taking values in the range 1,2,...,m
    output: An associative array A with frequency estimates for each item in s
    
    A := new (empty) associative array
    while s is not empty:
        take a value i from s
        if i is in keys(A):
            A[i] := A[i] + 1
        else if |keys(A)| < k - 1:
            A[i] := 1
        else:
            for each K in keys(A):
                A[K] := A[K] - 1
                if A[K] = 0:
                    remove K from keys(A)
    return A

Error analysis

[edit]

Due to the required space constraints, when the algorithm reaches the limit of the size of the associated data structure , some frequencies may be reduced, resulting in the output frequency not matching the actual frequency . The specific range of the error is:

Upper bound

[edit]

Since each time an element appears in stream, the value of the key corresponding to this element will increase by at most one, we have:

Lower bound

[edit]

An intuitive explanation: Each time we perform the "minus one" operation, we will reduce the counters of the elements currently tracked in by , reducing a total of units. These units are gradually accumulated by our previous "increment one" operations. Therefore, each "minus one" can be regarded as undoing "increment one".

The stream has a total of elements, and at most "increment one" operations can occur. Therefore, at most "minus one" can occur. Hence we have the lower bound:

Properties

[edit]

The Misra–Gries algorithm uses O(k(log(m)+log(n))) space, where m is the number of distinct values in the stream and n is the length of the stream. The factor k accounts for the number of entries that are kept in the associative array A. Each entry consists of a value i and an associated counter c. The counter c can, in principle, take any value in {0,...,n}, which requires ⌈log(n+1)⌉ bits to store. Assuming that the values i are integers in {0,...,m-1}, storing them requires ⌈log(m)⌉ bits.

Every item which occurs more than n/k times is guaranteed to appear in the output array.[4] Therefore, in a second pass over the data, the exact frequencies for the k−1 items can be computed to solve the frequent items problem, or the majority problem. With the same arguments as above, this second pass also takes O(k(log(m)+log(n))) space.

The summaries (arrays) output by the algorithm are mergeable, in the sense that combining summaries of two streams s and r by adding their arrays keywise and then decrementing each counter in the resulting array until only k keys remain results in a summary of the same (or better) quality as compared to running the Misra-Gries algorithm over the concatenation of s with r.

Solve majority problem

[edit]

In practice, the majority problem can be solved by setting and running Misra-Gries Algorithm in two passes.(Strictly speaking, the second pass is not the Misra–Gries algorithm, but a normal counting pass. Beacuse the second pass accurately counts the elements from output of the first pass)

First, according to the lower bound of the error, , where is the majority and . Hence , if and only if .

Since it has been proved that , the majority must be in the result of the first pass output. However, we can not know which element is the majority in one pass because of the error.

In the second pass, counters are maintained only for elements in the output of the first pass, and count exactly the number of times when each element appears. If the frequency of an element is greater than , this element must be the majority.

Example

[edit]

Let k=2 and the data stream be 1,4,5,4,4,5,4,4 (n=8,m=5). Note that 4 is appearing 5 times in the data stream which is more than n/k=4 times and thus should appear as the output of the algorithm.

Since k=2 and |keys(A)|=k−1=1 the algorithm can only have one key with its corresponding value. The algorithm will then execute as follows(- signifies that no key is present):

Example Execution of Misra–Gries
Stream Value Key Value
1 1 1
4 0 (full, the value of key 1 is reduced by 1, key 1 is removed)
5 5 1
4 0 (full, the value of key 5 is reduced by 1, key 5 is removed)
4 4 1
5 0 (full, the value of key 4 is reduced by 1, key 4 is removed)
4 4 1
4 4 2

First pass output (Misra-Gries output): (key=4, estimated frequency=2)

In second pass, only one counter is maintained for key , which has a value of . Hence key is the majority.

Second pass output: 4

References

[edit]
  1. ^ Misra, J.; Gries, David (1982). "Finding repeated elements". Science of Computer Programming. 2 (2): 143–152. doi:10.1016/0167-6423(82)90012-0. hdl:1813/6345.
  2. ^ Cormode, Graham (2014). "Misra-Gries Summaries". In Kao, Ming-Yang (ed.). Encyclopedia of Algorithms. Springer US. pp. 1–5. doi:10.1007/978-3-642-27848-8_572-1. ISBN 9783642278488.
  3. ^ Graham Cormode, Misra-Gries Summaries
  4. ^ a b Cormode 2014.