没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
LBNL-627561Bitmap Index Design Choices and Their Performance ImplicationsElizabeth O’Neil and Patrick O’NeilUniversity of Massachusetts at Boston{eoneil, poneil}@cs.umb.eduKesheng WuLawrence Berkeley National Laboratorykwu@lbl.govABSTRACTHistorically, bitmap indexing has provided an important database capability to accelerate queries.However, only a few database systems have implemented these indexes because of the difficulties ofmodifying fundamental assumptions in the low-level design of a da
资源推荐
资源详情
资源评论
LBNL-62756
1
Bitmap Index Design Choices and Their Performance
Implications
Elizabeth O’Neil and Patrick O’Neil
University of Massachusetts at Boston
{eoneil, poneil}@cs.umb.edu
Kesheng Wu
Lawrence Berkeley National Laboratory
kwu@lbl.gov
ABSTRACT
Historically, bitmap indexing has provided an important database capability to accelerate queries.
However, only a few database systems have implemented these indexes because of the difficulties of
modifying fundamental assumptions in the low-level design of a database system and in the expectations
of customers, both of which have developed in an environment that does not support bitmap indexes.
Another problem that arises, and one that may more easily be addressed by a research article, is that there
is no definitive design for bitmap indexes; bitmap index designs in Oracle, Sybase IQ, Vertica and
MODEL 204 are idiosyncratic, and some of them were designed for older machine architectures.
To investigate an efficient design on modern processors, this paper provides details of the Set Query
benchmark and a comparison of two research implementations of bitmap indexes. One, called RIDBit,
uses the N-ary storage model to organize table rows, and implements a strategy that gracefully switches
between the well-known B-tree RID-list structure and a bitmap structure. The other, called FastBit is
based on vertical organization of the table data, where all columns are individually stored. It implements a
compressed bitmap index, with a linear organization of the bitmaps to optimize disk accesses. Through
this comparison, we evaluate the pros and cons of various design choices. Our analysis adds a number of
subtleties to the conventional indexing wisdom commonly quoted in the database community.
1. INTRODUCTION
Bitmap indexes have not seen much new adoption in commercial database systems in recent years. While
ORACLE has offered bitmap indexing since 1995, other major systems such as DB2 and Microsoft SQL
Server do not provide them. Microsoft SQL Server may create bitmaps during hash joins, but not for
general indexing; DB2 has adopted an Encoded Vector Index [16], but this is basically an encoded
projection index rather than a bitmap index. Sybase Adaptive Server Enterprise (ASE), the major Sybase
DBMS, does not have bitmap indexing, although the Sybase Adaptive Server IQ product provides quite
competitive bitmap indexing for data warehousing. This situation arises in part because there is no
definitive design for bitmap indexes. To investigate such a definitive design, we plan to explore different
design choices through a careful study of two research implementations. Since we have control over all
aspects of the research implementations, we are able to try out some new
techniques for improving performances, such as new forms of compression and
careful disk placement. In the process of studying their performance pros and
cons, we also find some surprises.
A basic bitmap index (more simply, bitmap index in what follows) is typically
used to index values of a single column X in a table. This index consists of an
ordered sequence of keyvalues representing distinct values of the column, and
each keyvalue is associated with a bitmap that specifies the set of rows in the
table for which the column X has that value. A bitmap has as many bits as the
number of rows in the table, and the kth bit in the bitmap is set to 1 if the value
of column X in the kth row is equal to the keyvalue associated with the bitmap,
and 0 for any other column value. Table 1 shows a basic bitmap index on a
table with nine rows, where the column X to be indexed has integer values
ranging from 0 to 3. We say that the column cardinality of X is 4 because it has
4 distinct values. The bitmap index for X contains 4 bitmaps, shown as B
0
, B
1
,
Table 1: A bitmap
index for a column
named X. Columns
B
0
– B
3
are
bitmaps.
RID
X
B
0
B
1
B
2
B
3
0 2
0
0
1
0
1 1
0
1
0
0
2 3
0
0
0
1
3 0
1
0
0
0
4 3
0
0
0
1
5 1
0
1
0
0
6 0
1
0
0
0
7 0
1
0
0
0
8 2
0
0
1
0
LBNL-62756
2
…, B
3
, with subscripts corresponding to the value represented. In Table 1 the second bit of B
1
is 1
because the second row of X has the value 1, while corresponding bits of B
0
, B
2
and B
3
are all 0.
To answer a query such as “X > 1,” we perform bitwise OR (|) operations between successive long-words
of B
2
and B
3
, resulting in a new bitmap that can take part in additional operations. Since bitwise logical
operations such as OR (|), AND (&) and NOT (~) are well-supported by computer hardware, a bitmap
index software could evaluate SQL predicates extremely quickly. Because of this efficiency, even some
DBMS systems that don’t support bitmap indexes will convert intermediate solutions to bitmaps for some
operations. For example, PostgreSQL 8.1.5 has no bitmap index, but uses bitmaps to combine some
intermediate solutions [10]. Similarly, Microsoft SQL Server has a bitmap operator for filtering out rows
that don’t participate in a join operation [5].
Let N denote the number of rows in the table T and C(X) the cardinality of column X. It is easy to see that
a basic bitmap index like the one in Table 1 requires N•C(X) bits in the bitmaps. In the worst case where
every column value is distinct, so that C(X) = N, such a bitmap index requires N
2
bits. For a large dataset
with many millions of rows, such an index would be much larger than the table being indexed. For this
reason, much of the research on bitmap indexes has focused on compressing bitmaps to minimize index
sizes. However, operations on compressed bitmaps are often slower than on uncompressed ones, called
verbatim bitmaps. There is a delicate balance between reducing index size and reducing query response
time, which complicates the design considerations for bitmap index implementations.
Two very different approaches to reducing index sizes are used by the research prototypes we study.
FastBit implements the Word-Aligned Hybrid (WAH) compression; the WAH compressed basic bitmap
index was shown to be efficient in [18][19]. RIDBit employs a combination of verbatim bitmaps and
RID-lists composed of compact (two-byte) Row Identifiers (RIDS). Its unique ability to gracefully switch
from verbatim bitmaps to RID-lists based on the column cardinality originated with MODEL 204 [6].
The implementations of FastBit and RIDBit were quite different at the beginning of our study, which
made them ideal for contrasting the different implementation strategies and physical design choices. As
our study progressed, a number of implementation ideas found to be superior in FastBit were copied in
RIDBit; RIDBit software was also modified to better utilize the CPU. The lessons learned in this exercise
will be covered in the Summary section. The following table gives the key differences between RIDBit
and FastBit. We will discuss the detailed design of the two approaches in the next two sections.
FastBit RIDBit
Table layout Vertical storage (columns stored separately) N-ary storage (columns stored together in row)
Index layout Arrays of bitmaps B-tree keyed on keyvalues (improved in project)
Bitmap layout Continuous Horizontally partitioned into 32K-bit Segments
Compression Word-Aligned Hybrid compression Sparse bitmap converted to RID-list
The topics covered in succeeding sections are as follows. In Sections 2 and 3, we describe the architecture
of FastBit and RIDBit. Section 4 provides a theoretical analysis of index sizes for different columns.
Section 5 describes the Set Query Benchmark [7], which is used to compare performance of RIDBit and
FastBit. Section 6 presents the detailed experimental measurements. Finally, Section 7 provides a
summary and lessons learned.
2. FASTBIT
FastBit started out as a research tool for studying how compression methods affect bitmap indexes, and
has been shown since to be an efficient access method in a number of scientific applications [12][17]. It
organizes data into tables (with rows and columns), where each table is vertically partitioned and different
columns stored in separate files. Very large tables are also horizontal partitioned, each partition typically
consisting of many millions of rows. A partition is organized as a directory, with a file containing the
schema, followed by the data files for each column. This vertical data organization is similar to a number
of contemporary database systems such as Sybase IQ [9], MonetDB [1][2], Kx systems [4], and C-Store
LBNL-62756
3
[14]. Each column is effectively a projection index as defined in [9] and can sometimes be used
efficiently to answer queries without additional indexing structures. FastBit currently indexes only fixed-
sized columns, such as integers and floating-point numbers, although it can index low-cardinality string-
valued columns through a dictionary that converts the strings to integers. Because of this restriction, the
mapping from a row to a row identifier is
straightforward.
FastBit implements a number of different bitmap
indexes with various binning, encoding and
compression strategies [13]. The index used in this
study is the WAH compressed basic bitmap index.
All bitmaps of an index are stored in a single file
as shown in Table 2. Logically, an index file
contains two sets of values: the keyvalues and the
compressed bitmaps. FastBit stores both the keyvalues and bitmaps in arrays on disk. Since each keyvalue
is the same size, it can be located easily. To locate the bitmaps, FastBit stores another array starts[] to
record the starting position of all compressed bitmaps in the index file (in bitmaps[ ]). To simplify the
software, one extra values is used in array starts[] to record the ending position of the last bitmap.
FastBit generates all bitmaps of an entire index for one partition in memory before writing the index file.
This dictates that the entire index must fit in memory and imposes an upper bound on how many rows a
horizontal partition can hold on a given computer system. Typically, a partition has no more than 100
million rows, so that a small number of bitmap indexes may be built in-memory at once.
In general, FastBit store the array keyvalues[ ] in ascending order so that it can efficiently locate any
particular value. In some cases, it is possible to replace this array with a hash function. Using hash
functions typically requires fewer I/O operations to answer a query than using arrays do, but using arrays
more easily accommodates arbitrary keyvalues. FastBit uses memory maps to access the array
keyvalues[] and starts[] if the OS supports it; otherwise it reads the two arrays entirely into memory.
Since the index for a partition has to fit in memory when built, this reading procedure does not impose
any additional constraint on the sizes of the partitions.
One advantage of the linear layout of the bitmaps is that it minimizes the number of I/O operations when
answering a query. For example, to answer the range predicate “3 < KN < 10”, FastBit needs to access
bitmaps for values 4 through 9. Since these bitmaps are laid out consecutively in the index file, FastBit
reads all these bitmaps in one sequential read operation.
The linear layout of bitmaps means that FastBit is not in any way optimized for update. An update that
might add or subtract a 1-bit to one of the bitmaps would require modification of the bitmap, followed by
a reorganization of all successive bitmaps in the set. In scientific applications, changes in real time
between queries are unusual, so this limitation is not a serious drawback, and it is not a problem for most
commercial data warehousing applications either. Furthermore, we will see in Section 7 that the new
Vertica database product [14] provides a model where a fixed index for stable data can be maintained on
disk while new data is inserted to a memory resident dynamic store that takes part in all queries.
FastBit reconstitutes a C++ bitmap data structure from the bytes read into memory. This step makes it
easy to use the bitwise logical operation functions implemented in C++; however, it introduces
unnecessary overhead by invoking the C++ constructor and destructor. Additionally, since FastBit
aggregates the read operations for many bitmaps together, a certain amount of memory management is
required to produce the C++ bitmap objects.
3. RIDBIT
RIDBit was developed as a pedagogical exercise for an advance database internals course, to illustrate
how a bitmap indexing capability could be developed. The RIDBit architecture is based on index design
first developed for the Model 204 Database product from Computer Corporation of America [6]. We can
Table 2: Content of an index file.
N Number of rows
C Column cardinality
keyvalues[C]
Distinct values associate with each bitmap
starts[C+1] Starting position of each compressed bit-
map (final position is end of all bitmaps)
bitmaps[C] WAH Compressed bitmaps
剩余14页未读,继续阅读
资源评论
weixin_38666823
- 粉丝: 5
- 资源: 971
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 国际象棋检测2-YOLO(v5至v11)、COCO、CreateML、Paligemma、TFRecord、VOC数据集合集.rar
- ssd5课件图片记录保存
- 常用算法介绍与学习资源汇总
- Python与Pygame实现带特效的圣诞节场景模拟程序
- 国际象棋检测11-YOLO(v7至v9)、COCO、Darknet、Paligemma、VOC数据集合集.rar
- 使用Python和matplotlib库绘制爱心图形的技术教程
- Java外卖项目(瑞吉外卖项目的扩展)
- 必应图片壁纸Python爬虫代码bing-img.zip
- 基于Pygame库实现新年烟花效果的Python代码
- 浪漫节日代码 - 爱心代码、圣诞树代码
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功