没有合适的资源?快使用搜索试试~ 我知道了~
Data Structures for Text Sequences
4星 · 超过85%的资源 需积分: 10 23 下载量 41 浏览量
2009-03-30
22:56:21
上传
评论
收藏 330KB PDF 举报
温馨提示
试读
29页
Data Structures for Text Sequences 有关编写文本编辑器时读取文本序列的数据结构
资源推荐
资源详情
资源评论
Data Structures for Text Sequences
Charles Crowley
University of New Mexico
June 10, 1998
Abstract
The data structure used ot maintain the sequence of characters is an important part of a text
editor. This paper investigates and evaluates the range of possible data structures for text
sequences. The ADT interface to the text sequence comp onent of a text editor is examined.
Six common sequence data structures (array, gap, list, line p ointers, xed size buers and piece
tables) are examined and then a general model of sequence data structures that encompasses
all six structures is presented. The piece table metho d is explained in detail and its advantages
are presented. The design space of sequence data structures is examined and several variations
on the ones listed above are presented. These sequence data structures are compared exp eri-
mentally and evaluated based on a number of criteria. The exp erimental comparison is done by
implementing each data structure in an editing simulator and testing it using a synthetic load
of many thousands of edits. We also report on experiments on the senstivity of the results to
variations in the parameters used to generate the synthetic editing load.
1 Intro duction
The central data structure in a text editor is the one that manages the sequence of characters that
represents the current state of the le that is b eing edited. Every text editor requires such a data
structure but b o oks on data structures do not cover data structures for text sequences. Articles on
the design of text editors often discuss the data structure they use [1, 3, 6, 8 , 11, 12] but they do
not cover the area in a general way. This article is concerned with such data structures.
Figure 1 shows where sequence data structures t in with other data structures. Some ordered
sets are ordered by something intrinsic in the items in the sets (e.g., the value of an integer, the
lexicographic p osition of a string) and the p osition of an inserted item dep ends on its value and
the values of the items already in the set. Such data structures are mainly concerned with fast
searching. Data structures for this typ e of ordered set have b een studied extensively.
The other p ossibility is for the order to b e determined by where the items are placed when they are
inserted into the set. If insert and delete is restricted to the two ends of the ordering then you have
a deque. For deques, the two basic data structures are an array (used circularly) and a linked list.
Nothing b eyond this is necessary due to the simplicity of the ADT interface to deques. If you can
insert and delete items from anywhere in the ordering you have a sequence. An imp ortant sub class
Author's address: Computer Science Department, University of New Mexico, Albuquerque, New Mexico 87131,
oce: 505-277-5446, messages: 505-277-3112, fax: 505-277-0813, email: crowley@unmvax.cs.unm.edu
1
Linked
List
Array Gap Piece
tables
Ordered sets
(with inserts,
deletes and lookups)
Ordered by
where inserted
Insert/delete
at ends only
(Deque)
Unrestricted
insert/delete
(Sequence)
Ordered by
item attributes
Tree Heap Hash
table
. . .
Abstract
data type
Data
structure
Linked
List
Array
Fixed size
buffers
Line
spans
Figure 1: Ordered sets
is sequences where reading an item in the sequence (by p osition numb er) is extremely lo calized.
This is the case for text editors and it is this sub class that is examined in this pap er.
A linked list and an array are the two obvious data structures for a sequence. Neither is suitable
for a general purp ose text editor (a linked list takes up to o much memory and an array is to o slow
b ecause it requires to o much data movement) but they provide useful base cases on which to build
more complex sequence data structures. The gap metho d is a simple extension of an array, it is
simply an array with a gap in the middle where characters are inserted and deleted. Many text
editors use the gap metho d since it is simple and quite ecient but the demands on a mo dern text
editor (multiple les, very large les, structured text, sophisticated undo, virtual memory, etc.)
encourage the investigation of more complicated data structures which might handle these things
more eectively.
The more sophisticated sequence data structures keep the sequence as a recursive sequence of spans
of text. The line span metho d keeps each line together and keeps an array or a linked list of line
p ointers. The xed buer metho d keeps a linked list of xed size buers each of which is partially
full of text from the sequence. Both the line span metho d and the xed buer metho d have b een
used for many text editors.
A less commonly used metho d is the piece table metho d which keeps the text as a sequence of
\pieces" of text from either the original le and an \added text" le. This metho d has many
advantages and these will b ecome clear as the metho ds are presented in detail and analyzed. A
ma jor purp ose of this pap er is to describ e the piece table metho d and explain why it is a go o d data
structure for text sequences.
2
Lo oking at metho ds in detail suggests a general mo del of sequence data structures that subsumes
them all. Based on examination of this general mo del I will prop ose several new sequence data
structures that do not app ear to have b een tried b efore.
It is dicult to analyze these algorithms mathematically so an exp erimental approach was taken. I
have implemented a text editor simulator and a numb er of sequence data structures. Using this as
an exp erimental text b ed I have compared the p erformance of each of these sequence data structures
under a variety of conditions. Based on these exp eriments and other considerations I conclude with
recommendations on which sequence data structures are b est in various situations. In almost all
cases, either the gap metho d or the piece table metho d is the b est data structure.
2 Sequence Interfaces
It is useful to start out with a denition of a sequence and the interface to it. Since the more
sophisticated text sequence data structures are recursive in that they require a comp onent data
structure that maintains a sequence of p ointers, I will formulate the sequence of a general sequence
of \items" rather than as a sequence of characters. This supp orts discussion of the recursive
sequence data structures b etter.
2.1 The Sequence Abstract Data Typ e
A text editor maintains a sequence of characters, by implementing some variant of the abstract
date typ e Sequence. One denition of the Sequence abstract data typ e is:
Domains:
Item
| the data type that this is a sequence of (in most cases it will b e an ASCI I character).
Sequence
| an ordered set of
Item
s.
Position
| an index into the
Sequence
which identies the
Item
s (in this case it will b e
a natural numb er from 0 to the length of the
Sequence
minus one).
Syntax:
E mpty
:
!
Sequence
I nser t
:
Sequence
Position
Item
!
Sequence
D elete
:
Sequence
Position
!
Sequence
I temAt
:
Sequence
Position
!
Item
[ f
EndOfFile
g
Types:
s
:
Sequence
;
i
:
Item
;
3
p
,
p
1
,
p
2
:
Position
;
Axioms:
1.
D elete
(
E mpty ; p
) =
E mpty
2.
D elete
(
I nser t
(
s; p
1
; i
)
; p
2
) =
if
p
1
< p
2
then
I nser t
(
D elete
(
s; p
2
?
1)
; p
1
; i
)
if
p
1
=
p
2
then
s
if
p
1
> p
2
then
I nser t
(
D elete
(
s; p
2
)
; p
1
?
1
; i
)
3.
I temAt
(
E mpty ; p
) =
E ndO f F il e
4.
I temAt
(
I nser t
(
s; p
1
; i
)
; p
2
) =
if
p
1
< p
2
then
I temAt
(
s; p
2
?
1)
if
p
1
=
p
2
then
i
if
p
1
> p
2
then
I temAt
(
s; p
2
)
The denition of a
Sequence
is relatively simple. Axiom 1 says that deleting from an
E mpty
Sequence
is a no-op. This could b e considered an error. Axiom 2 allows the reduction of a
Sequence
of
I nser t
s and
D elete
s to a
Sequence
containing only
I nser t
s. This denes a canonical
form of a
Sequence
which is a
Sequence
of
I nser t
s on a initial
E mpty
Sequence
. Axiom 3
implies that reading outside the
Sequence
returns a sp ecial
E ndO f F il e
item. This also could
have b een an error. Axiom 4 denes the semantics of a
Sequence
by dening what is at each
p osition of a canonical
Sequence
.
1
2.2 The C Interface
How would this translate into C? First some typ e denitions:
typ edef
ReturnCode int
; /* 1 for success, zero or negative for failure */
typ edef
Position int
; /* a p osition in the sequence */
/* the rst item in the sequence is at p osition 0 */
typ edef
Item unsigned char
; /* they are sequences of eight bit bytes */
typ edef
struct
f
/* To b e determined */
/* Whatever information we need for the data structures we choose */
g
Sequence
;
In this interface the only op erations that change the Sequence are
Insert
and
Delete
.
1
I am ignoring the error of inserting b eyond the end of the existing sequence.
4
Sequence Empty( );
ReturnCode Insert( Sequence *sequence, Position position, Item ch );
ReturnCode Delete( Sequence *sequence, Position position );
Item ItemAt( Sequence *sequence, Position position );
| This do es not actually require a
p ointer to a Sequence since no change to the sequence is b eing made but we exp ect that they
will b e large structures and should not b e passing them around. I am ignoring error returns
(e.g., p osition out of range) for the purp oses of this discussion. These are easily added if
desired.
ReturnCode Close( Sequence *sequence );
Many variations are p ossible. The next few paragraphs discuss some of them.
Any practical interface would allow the sequence to b e initialized with the contexts of a le. In
theory this is just the
Empty
op eration followed by an
Insert
op eration for each character in the
initializing le. Of course, this is to o inecient for a real text editor.
2
Instead we would have a
NewSequence
op eration:
Sequence NewSequence( char * le
name );
| The sequence is initialized with the contents
of the le whose name is contained in `le
name'.
Usually the
Delete
op eration will delete any logically contiguous subsequence
ReturnCode Delete( Sequence *sequence, Position beginPosition, Position endPosition );
Sometimes the
Insert
op eration will insert a subsequence instead of just a single character.
ReturnCode Insert( Sequence *sequence, Position position, Sequence sequenceToInsert );
Sometimes
Copy
and
Move
are separate operations (instead of b eing comp osed of
Inserts
and
Deletes
).
ReturnCode Copy( Sequence *sequence, Position fromBegin, Position fromEnd, Position
toPosition );
ReturnCode Move( Sequence *sequence, Position fromBegin, Position fromEnd, Position
toPosition );
A
Replace
op eration that subsumes
Insert
and
Delete
in another p ossibility.
ReturnCode Replace( Sequence *sequence, Position fromBegin, Position fromEnd, Sequence
sequenceToReplaceItWith );
Finally the
ItemAt
pro cedure could retrieve a subsequence.
2
Although this is the method I use in my text editor simulator describ ed later.
5
剩余28页未读,继续阅读
资源评论
- dianba82014-06-23找到了我需要的内容,感觉还是挺有用的,点赞!
keminlau
- 粉丝: 411
- 资源: 32
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功