Jump to content

Bit array

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Dcoetzee (talk | contribs) at 05:58, 20 November 2004 (Wrote article, intro, basic operations, complex operations related to priority queues, C/C++/Java language support). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

A bit array is an array data structure which compactly stores individual bits (boolean values). It implements a simple set data structure storing a subset of {1,2,...,n} and is effective at exploiting bit-level parallelism in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the unit of storage, such as a byte or word, and k is some integer. If the number of bits to be stored does not divide w, some space is wasted.

Basic operations

Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated using bitwise operations. In particular:

  • OR can be used to set a bit to one: 11101010 OR 00000100 = 11101110
  • AND and NOT can be used to set a bit to zero: 11101010 AND (NOT 00000010) = 11101000
  • AND together with zero-testing can be used to determine if a bit is set:
    • 11101010 AND 00010000 = 00000000 = 0
    • 11101010 AND 00000010 = 00000010 ≠ 0
  • XOR can be used to invert or toggle a bit:
    • 11101010 XOR 00000100 = 11101110
    • 11101110 XOR 00000100 = 11101010

We can view a bit array as a subset of {1,2,...,n}, where a 1 bit indicates a number in the set and a 0 bit a number not in the set, which uses n/w spaces. Whether the least significant bit or the most significant bit indicates the smallest-index number is largely irrelevent, but the former tends to be preferred.

Given two bit arrays of the same size, we can compute their union, intersection, and set difference using n/w simple bit operations each (2n/w for difference), as well as the complement of either:

Template:Wikicode

 for i from 0 to n/w-1
     complement_a[i] := NOT a[i]
     union[i]        := a[i] OR b[i]
     intersection[i] := a[i] AND b[i]
     difference[i]   := a[i] AND (NOT b[i])

If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly-nested loop which loops through each word, one at a time. Only n/w memory accesses are required:

 for i from 0 to n/w-1
     index := 0    // if needed
     word := a[i]
     for b from 0 to w-1
         value := word AND 1 ≠ 0
         word := word shift right 1
         // do something with value
         index := index + 1   // if needed

More complex operations

If we wish to find the number of 1 bits in a bit array, there are efficient branch-free algorithms which can compute the number of bits in a word using a series of simple bit operations. We simply run such an algorithm on each word and keep a running total. Counting zeros is similar.

Similarly, sorting a bit array is trivial to do in O(n) time using bucket sort — we count the number of ones k, fill the last k/w words with ones, set only the low k mod w bits of the next word, and set the rest to zero.

Bit arrays are useful in some contexts as priority queues. The goal in such a context is to identify the one bit of smallest index. Some machines have a find first one or find first zero operation that does this on a single word. With this, the operation is obvious: find the first nonzero word and run find first one on it, or find first zero on its complement. When this instruction is not available, there are other tricky sequences of bit operations that can accomplish the same thing effectively, if not quite as quickly.

Advantages and disadvantages

Bit arrays, despite their simplicity, have a number of marked advantages over other data structures for the same problems:

  • They are extremely compact; and few other data structure can store n independent pieces of data in n/w words.
  • They allow small arrays of bits to be stored and manipulated in the register set for long periods of time with no memory accesses.
  • Because of their ability to exploit bit-level parallelism, limit memory access, and maximally utilize the data cache, they often outperform many other data structures on practical data sets, even those which are more efficient asymptotically.

However, bit arrays aren't the solution to everything. In particular:

  • They are wasteful set data structures for sparse sets, those with few elements compared to their range, in both time and space. For such applications, Judy arrays, tries, van Emde Boas trees, or even Bloom filters should be considered instead.
  • Accessing individual elements can be expensive and difficult to express in some languages. If random access is more common than sequential and the array is relatively small, a byte array may be preferable on a machine with byte addressing.

Language support

The C programming language's bitfields, pseudo-objects found in structs with size equal to some number of bits, are in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the bits are still accessed using bitwise operators on most machines, and they can only be defined statically (like C's static arrays, their sizes are fixed at compile-time). It is also a common idiom for C programmers to use words as small bit arrays and access bits of them using bit operators.

In C++, although individual bools typically occupy a byte, the STL type vector<bool> is a partial specialization which compresses bits into words. Another, very unique STL class, bitset, creates a vector of bits fixed at a particular size at compile-time, and which in its interface and syntax more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as the ability to efficiently count the number of bits that are set.

In Java, the class java.util.BitSet, very much like C++'s bitset, creates a fixed-size bit array which is then manipulated with functions named after bitwise operators familiar to C programmers.