没有合适的资源?快使用搜索试试~ 我知道了~
Access Path Selection in a Relational Database Management System
需积分: 0 33 下载量 184 浏览量
2015-03-16
06:13:40
上传
评论
收藏 218KB PDF 举报
温馨提示
试读
12页
In a high level query and data manipulation language such as SQL, requests are stated non-procedurally, without reference to access paths. This paper describes how System R chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates. System R is an experimental database management system developed to carry out research on the relational model of data. System R was designed and built by members of the IBM San Jose Research Laboratory.
资源推荐
资源详情
资源评论
23
Access Path Selection
in a Relational Database Management System
P. Griffiths Selinger
M. M. Astrahan
D. D. Chamberlin
R. A. Lorie
T. G. Price
IBM Research Division, San Jose, California 95193
ABSTRACT: In a high level query and data
manipulation language such as SQL, requests
are stated non-procedurally, without refer-
ence to access paths. This paper describes
how System R chooses access paths for both
simple (single relation) and complex que-
ries (such as joins), given a user specifi-
cation of desired data as a boolean
expression of predicates. System R is an
experimental database management system
developed to carry out research on the rela-
tional model of data. System R was designed
and built by members of the IBM San Jose
Research Laboratory.
1. Introduction
System R is an experimental database man-
agement system based on the relational
model of data which has been under develop-
ment at the IBM San Jose Research Laboratory
since 1975 <1>. The software was developed
as a research vehicle in relational data-
base, and is not generally available out-
side the IBM Research Division.
This paper assumes familiarity with rela-
tional data model terminology as described
in Codd <7> and Date <8>. The user interface
in System R is the unified query, data def-
inition, and manipulation language SQL <5>.
Statements in SQL can be issued both from an
on-line casual-user-oriented terminal
interface and from programming languages
such as PL/I and COBOL.
In System R a user need not know how the
tuples are physically stored and what
access paths are available (e.g. which col-
umns have indexes). SQL statements do not
require the user to specify anything about
the access path to be used for tuple
retrieval. Nor does a user specify in what
order joins are to be performed. The System
R optimizer chooses both join order and an
access path for each table in the SQL state-
ment. Of the many possible choices, the
optimizer chooses the one which minimizes
“total access cost” for performing the
entire statement.
This paper will address the issues of
access path selection for queries.
Retrieval for data manipulation (UPDATE,
DELETE) is treated similarly. Section 2
will describe the place of the optimizer in
the processing of a SQL statement, and sec-
tion 3 will describe the storage component
access paths that are available on a single
physically stored table. In section 4 the
optimizer cost formulas are introduced for
single table queries, and section 5 dis-
cusses the joining of two or more tables,
and their corresponding costs. Nested que-
ries (queries in predicates) are covered in
section 6.
2. Processing of an SQL statement
A SQL statement is subjected to four
phases of processing. Depending on the ori-
gin and contents of the statement, these
phases may be separated by arbitrary inter-
vals of time. In System R these arbitrary
time intervals are transparent to the sys-
tem components which process a SQL state-
ment. These mechanisms and a description of
the processing of SQL statements from both
programs and terminals are further dis-
cussed in <2>. Only an overview of those
processing steps that are relevant to
access path selection will be discussed
here.
The four phases of statement processing
are parsing, optimization, code generation,
and execution. Each SQL statement is sent to
the parser, where it is checked for correct
syntax. A query block
is represented by a
SELECT list, a FROM list, and a WHERE tree,
containing, respectively the list of items
to be retrieved, the table(s) referenced,
and the boolean combination of simple pred-
icates specified by the user. A single SQL
statement may have many query blocks
because a predicate may have one operand
which is itself a query.
If the parser returns without any errors
detected, the OPTIMIZER component is
called. The OPTIMIZER accumulates the names
Copyright © 1979 by the ACM, Inc., used by permission. Permis-
sion to make digital or hard copies is granted provided that copies
are not made or distributed for profit or direct commercial advan-
tage, and that copies show this notice on the first page or initial
screen of a display along with the full citation.
Originally published in the Proceedings of the 1979 ACM SIGMOD
International Conference on the Management of Data.
Digital recreation by Eric A. Brewer, brewer@cs.berke-
ley.edu, October 2002.
24
of tables and columns referenced in the
query and looks them up in the System R cat-
alogs to verify their existence and to
retrieve information about them.
The catalog lookup portion of the OPTI-
MIZER also obtains statistics about the
referenced relations, and the access paths
available on each of them. These will be
used later in access path selection. After
catalog lookup has obtained the datatype
and length of each column, the OPTIMIZER
rescans the SELECT-list and WHERE-tree to
check for semantic errors and type compati-
bility in both expressions and predicate
comparisons.
Finally the OPTIMIZER performs access
path selection. It first determines the
evaluation order among the query blocks in
the statement. Then for each query block,
the relations in the FROM list are pro-
cessed. If there is more than one relation
in a block, permutations of the join order
and of the method of joining are evaluated.
The access paths that minimize total cost
for the block are chosen from a tree of
alternate path choices. This minimum cost
solution is represented by a structural
modification of the parse tree. The result
is an execution plan in the Access Specifi-
cation Language (ASL) <10>.
After a plan is chosen for each query
block and represented in the parse tree, the
CODE GENERATOR is called. The CODE GENERA-
TOR is a table-driven program which trans-
lates ASL trees into machine language code
to execute the plan chosen by the OPTIMIZER.
In doing this it uses a relatively small
number of code templates, one for each type
of join method (including no join). Query
blocks for nested queries are treated as
“subroutines” which return values to the
predicates in which they occur. The CODE
GENERATOR is further described in <9>.
During code generation, the parse tree is
replaced by executable machine code and its
associated data structures. Either control
is immediately transferred to this code or
the code is stored away in the database for
later execution, depending on the origin of
the statement (program or terminal). In
either case, when the code is ultimately
executed, it calls upon the System R inter-
nal storage system (RSS) via the storage
system interface (RSI) to scan each of the
physically stored relations in the query.
These scans are along the access paths cho-
sen by the OPTIMIZER. The RSI commands that
may be used by generated code are described
in the next section.
3. The Research Storage System
The Research Storage System (RSS) is the
storage subsystem of System R. It is respon-
sible for maintaining physical storage of
relations, access paths on these relations,
locking (in a multi-user environment), and
logging and recovery facilities. The RSS
presents a tuple-oriented interface (RSI)
to its users. Although the RSS may be used
independently of System R, we are concerned
here with its use for executing the code
generated by the processing of SQL state-
ments in System R, as described in the pre-
vious section. For a complete description
of the RSS, see <1>.
Relations are stored in the RSS as a col-
lection of tuples whose columns are physi-
cally contiguous. These tuples are stored
on 4K byte pages; no tuple spans a page.
Pages are organized into logical units
called segments. Segments may contain one
or more relations, but no relation may span
a segment. Tuples from two or more relations
may occur on the same page. Each tuple is
tagged with the identification of the rela-
tion to which it belongs.
The primary way of accessing tuples in a
relation is via an RSS scan. A scan returns
a tuple at a time along a given access path.
OPEN, NEXT, and CLOSE are the principal com-
mands on a scan.
Two types of scans are currently avail-
able for SQL statements. The first type is a
segment scan to find all the tuples of a
given relation. A series of NEXTs on a seg-
ment scan simply examines all pages of the
segment which contain tuples, from any
relation, and returns those tuples belong-
ing to the given relation.
The second type of scan is an index scan.
An index may be created by a System R user
on one or more columns of a relation, and a
relation may have any number (including
zero) of indexes on it. These indexes are
stored on separate pages from those con-
taining the relation tuples. Indexes are
implemented as B-trees <3>, whose leaves
are pages containing sets of (key, identi-
fiers of tuples which contain that key).
Therefore a series of NEXTs on an index scan
does a sequential read along the leaf pages
of the index, obtaining the tuple identifi-
ers matching a key, and using them to find
and return the data tuples to the user in
key value order. Index leaf pages are
chained together so that NEXTs need not ref-
erence any upper level pages of the index.
In a segment scan, all the non-empty
pages of a segment will be touched, regard-
less of whether there are any tuples from
the desired relation on them. However, each
page is touched only once. When an entire
relation is examined via an index scan, each
page of the index is touched only once, but
a data page may be examined more than once
if it has two tuples on it which are not
“close” in the index ordering. If the tuples
are inserted into segment pages in the index
ordering, and if this physical proximity
corresponding to index key value is main-
tained, we say that the index is clustered
.
A clustered index has the property that not
only each index page, but also each data
page containing a tuple from that relation
25
will be touched only once in a scan on that
index.
An index scan need not scan the entire
relation. Starting and stopping key values
may be specified in order to scan only those
tuples which have a key in a range of index
values. Both index and segment scans may
optionally take a set of predicates, called
search arguments (or SARGS), which are
applied to a tuple before it is returned to
the RSI caller. If the tuple satisfies the
predicates, it is returned; otherwise the
scan continues until it either finds a tuple
which satisfies the SARGS or exhausts the
segment or the specified index value range.
This reduces cost by eliminating the over-
head of making RSI calls for tuples which
can be efficiently rejected within the RSS.
Not all predicates are of the form that can
become SARGS. A sargable predicate
is one of
the form (or which can be put into the form)
“column comparison-operator value”. SARGS
are expressed as a boolean expression of
such predicates in disjunctive normal form.
4. Costs for single relation access paths
In the next several sections we will
describe the process of choosing a plan for
evaluating a query. We will first describe
the simplest case, accessing a single rela-
tion, and show how it extends and general-
izes to 2-way joins of relations, n-way
joins, and finally multiple query blocks
(nested queries).
The OPTIMIZER examines both the predi-
cates in the query and the access paths
available on the relations referenced by
the query, and formu1ates a cost prediction
for each access plan, using the following
cost formula:
COST = PAGE FETCHES + W * (RSI CALLS).
This cost is a weighted measure of I/O
(pages fetched) and CPU utilization
(instructions executed). W is an adjustable
weighting factor between I/O and CPU. RSI
CALLS is the predicted number of tuples
returned from the RSS. Since most of System
R’s CPU time is spent in the RSS, the number
of RSI calls is a good approximation for CPU
utilization. Thus the choice of a minimum
cost path to process a query attempts to
minimize total resources required.
During execution of the type-compatibil-
ity and semantic checking portion of the
OPTIMIZER, each query block’s WHERE tree of
predicates is examined. The WHERE tree is
considered to be in conjunctive normal
form, and every conjunct is called a boolean
factor. Boolean factors are notable because
every tuple returned to the user must sat-
isfy every boolean factor. An index is said
to match a boolean factor if the boolean
factor is a sargable predicate whose refer-
enced column is the index key; e.g., an
index on SALARY matches the predicate ‘SAL-
ARY = 20000’. More precisely, we say that a
predicate or set of predicates matches
an
index access path when the predicates are
sargable and the columns mentioned in the
predicate(s) are an initial substring of
the set of columns of the index key. For
example, a NAME, LOCATION index matches
NAME = 'SMITH' AND LOCATION = 'SAN JOSE'. If
an index matches a boolean factor, an access
using that index is an efficient way to sat-
isfy the boolean factor. Sargable boolean
factors can also be efficiently satisfied
if they are expressed as search arguments.
Note that a boolean factor may be an entire
tree of predicates headed by an OR.
During catalog lookup, the OPTIMIZER
retrieves statistics on the relations in
the query and on the access paths available
on each relation. The statistics kept are
the following:
For each relation T,
- NCARD(T), the cardinality of relation T.
- TCARD(T), the number of pages in the seg-
ment that hold tuples of relation T.
- P(T), the fraction of data pages in the
segment that hold tuples of relation T.
P(T) = TCARD(T) / (no. of non-empty
pages in the segment).
For each index I on relation T,
- ICARD(I),
number of distinct keys in index
I.
- NINDX(I), the number of pages in index I.
These statistics are maintained in the
System R catalogs, and come from several
sources. Initial relation loading and index
creation initialize these statistics. They
are then updated periodically by an UPDATE
STATISTICS command, which can be run by any
user. System R does not update these statis-
tics at every INSERT, DELETE, or UPDATE
because of the extra database operations
and the locking bottleneck this would cre-
ate at the system catalogs. Dynamic updat-
ing of statistics would tend to serialize
accesses that modify the relation contents.
Using these statistics, the OPTIMIZER
assigns a selectivity factor
'F' for each
boolean factor in the predicate list. This
selectivity factor very roughly corresponds
to the expected fraction of tuples which
will satisfy the predicate. TABLE 1 gives
the selectivity factors for different kinds
of predicates. We assume that a lack of sta-
tistics implies that the relation is small,
so an arbitrary factor is chosen.
Query cardinality (QCARD) is the product
of the cardinalities of every relation in
the query block’s FROM list times the prod-
uct of all the selectivity factors of that
剩余11页未读,继续阅读
资源评论
Quantum_bit
- 粉丝: 2
- 资源: 39
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 《CKA/CKAD应试指南/从docker到kubernetes 完全攻略》学习笔记 第1章docker基础(1.1-1.4)
- 基于python实现的水下压缩空气储能互补系统建模仿真与经济效益分析+源代码+论文
- 华中科技大学-自然语言处理实验,Bi-LSTM+CRF的中文分词框架,并且利用基于深度学习的方法进行中文命名实体识别++源码报告
- 基于动态罚函数的铁路车流分配与径路优化模型python源码
- 鱼群算法求解组环问题python源码+文档说明
- 基于决策优化的多波束测深测线规划模型MATLAB代码
- 课程设计-基于python实现的多目标优化算法求解带时间窗的车辆路径规划问题+源代码+文档说明+界面截图+pptx
- 基于通信信号与通信系统的MATLAB仿真源码-课程设计
- 嵌入式-信号机制(概念,发送,定时,捕捉,SIGCHLD 信号实现回收子进程)
- c语言管理系统大一大二笔记
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功