没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Prefix B-TreesRUDOLF BAYER and KARL UNTERAUERTechnische Universitiit MiinchenTwo modifications of B-trees are described, simple prefix B-trees and prefix B-trees. Both store only parts of keys, namely prefixes, in the index part of a B*-tree. In simple prefix B- trees those prefixes are selected carefully to minimize their length. In prefix B-trees the pre- fixes need not he fully stored, but are reconstructed as the tree is searched. Prefix B-trees are designed to combine some of the advant
资源推荐
资源详情
资源评论
Prefix B-Trees
RUDOLF BAYER and KARL UNTERAUER
Technische Universitiit Miinchen
Two modifications of B-trees are described, simple prefix B-trees and prefix B-trees. Both
store only parts of keys, namely prefixes, in the index part of a B*-tree. In simple prefix B-
trees those prefixes are selected carefully to minimize their length. In prefix B-trees the pre-
fixes need not he fully stored, but are reconstructed as the tree is searched. Prefix B-trees are
designed to combine some of the advantages of B-trees, digital search trees, and key compres-
sion techniques while reducing the processing overhead of compression techniques.
Key Words and Phrases: B-trees, key compression, multiway search trees
CR Categories: 3.73, 3.74
1. SIMPLE PREFIX B-TREES
We assume that the reader is familiar with B-trees [4, 81 and with a variation of
B-trees, called B*-trees [8, lo]. In B*-trees the records of a file together with the
keys identifying them are only stored in leaf nodes of the tree structure. We call
the nonleaves branch nodes or branch pages. Leaves can be linked to their neigh-
bors to allow sequential processing of the leaves without using the branch nodes of
the B*-tree.
We call the part
of a B*-tree consisting only of the branch nodes the B*-index
and the ordered set of leaves the B*-file. In the B*-index some keys, which appear
again in the B*-file with their associated records, are repeated without their records.
The following observation is important for the rest of this paper: The keys stored
in the B*-index are only used to direct the search algorithm and to determine in
which subtree of a given branch node a key and its associated record will be found,
if they are in the tree at all.
It is now a fairly obvious observation that we need not necessarily use the keys
in
the
B*-file
to construct the B*-index. Instead we can use other strings, con-
structed to have desired properties, for building up the equivalent of the B*-index
of a B*-tree.
To give a simple example, we assume that a leaf is already full and contains
keys
Copyright @ 1977, Association for Computing Machinery, Inc. General permission to re-
publish, but not for profit, all or part of this material is granted provided that ACM’s copy-
right notice is given and that reference is made to the publication, to its date of issue, and
to the fact that reprinting privileges were granted by permission of the Association for Corn:
puting Machinery.
Most of the research reported in this paper was performed while the first author was visiting
at IBM&search Laboratory, San Jose, CA 95193. The research of the second author was
sup-
ported by the Sonderforschungsbereich 49-Elektronische Rechenanlagen und Informa-
tionsverarbeitung-of the Deutsche Forschungsgemeinschaft.
Authors’ address: Institut fiir Informatik
der Technischen Universitiit Munchen, Arcisstrasse
21, D-3999 Mtinchen 2, West Germany.
ACM Transactions on Database Systems, Vol. 2, No. 1, March 1977, Pages 11-26.
12
l
R. Bayer and K. Unterauer
Bigbird, Burt, Cookiemonster, Ernie, Snuffleopogus
In order to insert the key “Grouch” with its record, we must split this leaf into
two as follows:
Bigbird, Burt, Cookiemonster
Ernie, Grouch, Snuffleopogus
Instead of storing the key “Ernie” in the index, it obviously suffices to use one of
the one-letter strings “D”, “E” for the same purpose. In general we can select
any string s with the property
Cookiemonster < s 5 Ernie
(1)
and store it in the index part to separate the two nodes. We call such a string s
a separator (between Cookiemonster and Ernie). It seems prudent to choose one
of the shortest separators.
Note. If the keys are words over some alphabet and the ordering of the keys is
the alphabetic order, then the following property, called the pre$x property, holds:
Let x and y be any two keys such that x < y. Then there is a unique prefix g of
y such that (a) g is a separator between x and y, and (b) no other separator
between x and y is shorter than g. For the rest of this paper, we assume that the
prefix property holds.
The technique of moving a shortest separator to the father node when a node is
being split can be used only for splitting leaves, not branch nodes. When a branch
node is being split, one of the separators on that node must be moved up one
level in the tree.
As mentioned before, a B*-tree can be considered as consisting of a B*-index and
a B*-file. The B*-index itself is just a conventional B-tree of a subset of the keys
in the B*-file together with the maintenance algorithms for B-trees described in [4].
DeJinition. A simple prefix B-tree is a B*-tree in which the B*-index is replaced
by a B-tree of (variable length) separators.
Note. Since a key in a B*-index is also a separator, although not necessarily a
shortest possible separator, the class of simple prefix B-trees contains the class of
B*-trees.
Except for the slight complication of always having variable length separators,
the search algorithm for simple prefix B-trees is exactly the same as for B*-trees.
Split interval. The performance bottleneck of our trees is the number of ac-
cesses to the backup store needed for INSERT, DELETE, and RETRIEVE
operations. This number is essentially determined by the height of the tree since
the pages along the retrieval path for some key x from the root to some leaf are
always needed for those three operations.
Performance can therefore be improved by making the trees as flat as possible,
which can be achieved by making the branching degree of the nodes, especially
in the upper parts of the tree (i.e. near the root), as high as possible. This branch-
ing degree is determined by the number of (separator, pointer) pairs that can be
stored on a fixed size page. Pointers are generally rather short and have a fixed
ACM Transactions on Database Systems, Vol. 2, No. 1, March 1977.
Prefix B-Trees
l
13
size of a few bytes. Keys, however, tend to be rather long. The branching degree
can therefore be increased by decreasing the length of the separators. This essen-
tially is the rationale for storing only shortest separators instead of full keys in
the index part of a simple prefix B-tree.
This idea can now be carried one step further if we do not insist on splitting a
leaf precisely in the middle. Instead we could allow a certain split interval around
the middle (or the median key) of a splitting leaf within which a split point (be-
tween two adjacent keys xi-l and xi) should be chosen so as to minimize the length
of the shortest separator si . Note that because of the prefix property the resulting
separator si can still be chosen as a prefix of the key zi . The size of the split inter-
val is determined by a parameter UZ .
gz is simply the number of separators (or
bytes) around the middle of the page which we consider for choosing a suitable
split point.
The same idea can be applied to splitting branch nodes. A certain interval of
size
Ub
around the middle of a page is considered for choosing a split point such
that a shortest separator within this interval is moved to the father page.
Effect of ~1 and Ub . An increase of uz should decrease the average length of the
separators in the index part of the tree, thereby reducing the number of nodes in
the index part. An increase of Ug should favor the shorter separators in the index to
be located near the root, thereby increasing the branching degree of nodes near the
root, where a high branching degree is most beneficial.
Increasing both (rl and C7b causes two effects working against each other:
(a) It tends to decrease the height of the index part for the reasons just de-
scribed.
(b) The storage utilization decreases (there can now be pages less than half
full), which requires more pages in the file part and more but shorter entries in
the index part of a tree.
We have not analyzed the influence of uz or
Ub
on the performance of the trees.
We expect such an analysis to be quite involved and difficult. We are quite con-
fident, however, that small split intervals improve performance considerably.
Sets of keys that arise in practical applications are often far from random, and
clusters of similar keys differing only in the last few letters (e.g. plural forms) are
quite common.
As an example, consider the key sequence
“On, Part, Problem, Problems, Solution, Solutions”
arising in the tree of Figure 1. Splitting this sequence in the rr-ddle between the
third and fourth key would yield “Problems” as the shortest separator. Allowing a
split point to be chosen one key to the left or to the right yields “Pr” or “S’ as
separators. The split point between “Problems” and ‘Solution” yields the shortest
separator “S”,
which is a prefix of the full key “Solution” and appears in the tree
of Figure 1.
2. ALGORITHMS FOR SIMPLE PREFIX B-TREES
2.1 Search Algorithm
This algorithm is the same as for B*-trees with variable length keys.
ACM Transactions on Datsbsse Systems, Vol. 2, No. 1, March 1977.
剩余15页未读,继续阅读
资源评论
weixin_38740397
- 粉丝: 6
- 资源: 854
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功