2 D. LEMIRE, G. SSI-YAN-KAI, O. KASER
One popular approach has been to compress bitmaps with run-length encoding. Effectively,
instead of using dn/W e words of W-bits for all bitmaps, we look for runs of consecutive words
containing only ones or only zeros, and we replace them with markers that indicate which value
is being repeated, and how many repetitions there are. This approach was first put into practice by
Oracle’s BBC [3], using 8-bit words. Starting with WAH [4], there have been many variations on
this idea, including Concise [2], EWAH [5], COMPAX [6], and so on. (See § 2.)
When processing such RLE-compressed formats, one may need to read every compressed word
to determine whether a value is present in the set. Moreover, computing the intersection or union
between two bitmaps B
1
and B
2
has complexity O(|B
1
| + |B
2
|) where |B
1
| and |B
2
| are the
compressed sizes of the bitmaps. This complexity is worse than that of a hash set, where we can
compute an intersection with an expected-time complexity of O(min(|S
1
|, |S
2
|)) where |S
1
| and
|S
2
| are the cardinalities of the sets. Indeed, it suffices to iterate over the smallest sets, and for each
value, check whether it is present in the larger set. Similarly, we can compute an in-place union,
where the result is stored in the largest hash set, simply by inserting all of the values from the
small set in the large set, in expected time O(min(|S
1
|, |S
2
|)). It is comparatively more difficult
to compute in-place unions with RLE-compressed bitmaps such as Concise or WAH. Moreover,
checking whether a given value is present in an RLE-compressed bitmap like Concise or WAH may
require a complete scan of the entire bitmap in O(|B|) time. Such a scan can be hundreds of times
slower than checking for the presence of a value in an uncompressed bitmap or hash map.
For better performance, Chambi et al. [7] proposed the Roaring bitmap format, and made
it available as an open-source library.
†
Roaring partitions the space [0, n) into chunks of
2
16
integers ([0, 2
16
), [2
16
, 2 × 2
16
), . . .). Each set value is stored in a container corresponding to
its chunk. Roaring stores dense and sparse chunks differently. Dense chunks (containing more than
4096 integers) are stored using conventional bitmap containers (made of 2
16
bits or 8 kB) whereas
sparse chunks use smaller containers made of packed sorted arrays of 16-bit integers. All integers in
a chunk share the same 16 most-significant bits. The containers are stored in an array along with the
most-significant bits. Though we refer to a Roaring bitmap as a bitmap, it is a hybrid data structure,
combining uncompressed bitmaps with sorted arrays.
Roaring allows fast random access. To check for the presence of a 32-bit integer x, we seek
a container corresponding to the 16 most significant bits of x, using binary search. If a bitmap
container is found, we check the corresponding bit (at index x mod 2
16
); if an array container is
found, we use a binary search. Similarly, we can compute the intersection between two Roaring
bitmaps without having to access all of the data. Indeed, suppose that we have a Roaring bitmap
B
1
containing few values in the range [0, 2
16
),which implies it uses an array container. Suppose
we have another Roaring bitmap B
2
over the same range containing many values; it can only use a
bitmap container. In that case, computing the intersection between B
1
and B
2
can be done in time
O(|B
1
|), since it suffices to iterate over the set values of B
1
and check whether the corresponding bit
is set in the bitmap container of B
2
. Moreover, to compute the in-place union of B
1
and B
2
, where
the result is stored in the large bitmaps (B
2
), it suffices to iterate through the values of B
1
and set
the corresponding bits in B
2
to 1, in time O(|B
1
|). Checking whether a value is present in a Roaring
bitmap is relatively fast. It suffices to locate the corresponding container (by a binary search). If the
container is a bitmap, a simple bit-value check suffices, otherwise another binary search through a
packed array provides the desired answer.
Fast intersection, union and difference operations are made possible through fast operations
between containers, even when these containers have different types (i.e., bitmap-vs-bitmap, array-
vs-array, array-vs-bitmap, bitmap-vs-array). Though conceptually simple, these operations between
containers must produce new containers that are either arrays or bitmap containers. Because
converting between container types is potentially expensive, we found it useful to predict the
container type as part of the computation. For example, if we must compute the union between
two array containers such that the sum of their cardinalities exceed 4096, we preemptively create a
†
https://github.com/lemire/RoaringBitmap
评论0
最新资源