CHAPTER 1: INTRODUCTION TO INFORMATION STORAGE AND RETRIEVAL
SYSTEMS.............................................................................................................................. 9
1.1 INTRODUCTION............................................................................................................ 9
1.2 A DOMAIN ANALYSIS OF IR SYSTEMS.................................................................10
1.2.1 Conceptual Models of IR ......................................................................................... 11
1.2.2 File Structures.......................................................................................................... 12
1.2.3 Query Operations.....................................................................................................13
1.2.4 Term Operations...................................................................................................... 13
1.2.5 Document Operations .............................................................................................. 14
1.2.6 Hardware for IR.......................................................................................................14
1.2.7 Functional View of Paradigm IR System ................................................................ 15
1.3 IR AND OTHER TYPES OF INFORMATION SYSTEMS.........................................17
1.4 IR SYSTEM EVALUATION......................................................................................... 18
1.5 SUMMARY....................................................................................................................20
REFERENCES ..................................................................................................................... 21
CHAPTER 2: INTRODUCTION TO DATA STRUCTURES AND ALGORITHMS
RELATED TO INFORMATION RETRIEVAL.................................................................. 22
2.1 INTRODUCTION.......................................................................................................... 22
2.2 BASIC CONCEPTS.......................................................................................................22
2.2.1 Strings ......................................................................................................................23
2.2.2 Similarity between Strings....................................................................................... 23
2.2.3 Regular Expressions ................................................................................................ 23
2.2.4 Finite Automata.......................................................................................................25
2.3 DATA STRUCTURES ...................................................................................................27
2.3.1 Search Trees.............................................................................................................27
2.3.2 Hashing....................................................................................................................29
2.3.3 Digital Trees ............................................................................................................ 30
2.4 ALGORITHMS .............................................................................................................. 32
2.4.1 Retrieval Algorithms ...............................................................................................33
2.4.2 Filtering Algorithms ................................................................................................ 34
2.4.3 Indexing Algorithms................................................................................................ 34
REFERENCES ..................................................................................................................... 35
CHAPTER 3: INVERTED FILES ....................................................................................... 38
3.1 INTRODUCTION.......................................................................................................... 38
3.2 STRUCTURES USED IN INVERTED FILES .............................................................40
3.2.1 The Sorted Array .....................................................................................................40
3.2.2 B-trees......................................................................................................................40
3.2.3 Tries ......................................................................................................................... 41
3.3 BUILDING AN INVERTED FILE USING A SORTED ARRAY...............................41
3.4 MODIFICATIONS TO THE BASIC TECHNIQUE ..................................................... 44
3.4.1 Producing an Inverted File for Large Data Sets without Sorting ............................ 44
3.4.2 A Fast Inversion Algorithm..................................................................................... 47
3.5 SUMMARY....................................................................................................................52
REFERENCES ..................................................................................................................... 52
CHAPTER 4: SIGNATURE FILES..................................................................................... 54
4.1 INTRODUCTION.......................................................................................................... 54
4.2 BASIC CONCEPTS.......................................................................................................56
4.3 COMPRESSION ............................................................................................................ 59
4.3.1 Bit-block Compression (BC)................................................................................... 59
4.3.2 Variable Bit-block Compression (VBC).................................................................60
4.3.3 Performance.............................................................................................................62
4.4 VERTICAL PARTITIONING ....................................................................................... 62
4.4.1 Bit-Sliced Signature Files (BSSF)...........................................................................63
4.4.2 B'SSF, a Faster Version of BSSF............................................................................ 64
4.4.3 Frame-Sliced Signature File....................................................................................64
4.4.4 The Generalized Frame-Sliced Signature File (GFSSF).........................................65
4.4.5 Performance.............................................................................................................66
4.5 VERTICAL PARTITIONING AND COMPRESSION................................................. 67
4.5.1 Compressed Bit Slices (CBS)..................................................................................67
4.5.2 Doubly Compressed Bit Slices (DCBS).................................................................. 69
4.5.3 No False Drops Method (NFD)............................................................................... 70
4.5.4 Performance.............................................................................................................71
4.6 HORIZONTAL PARTITIONING ................................................................................. 72
4.6.1 Data Independent Case............................................................................................ 72
Gustafson's method.......................................................................................................72
Partitioned signature files .............................................................................................73
4.6.2 Data Dependent Case...............................................................................................73
Two-level signature files .............................................................................................. 73
S-tree.............................................................................................................................74
4.7 DISCUSSION................................................................................................................. 74
REFERENCES ..................................................................................................................... 75
CHAPTER 5: NEW INDICES FOR TEXT: PAT TREES AND PAT ARRAYS............... 79
5.1 INTRODUCTION.......................................................................................................... 79
5.2 THE PAT TREE STRUCTURE..................................................................................... 80
5.2.1 Semi-infinite Strings................................................................................................ 81
5.2.2 PAT Tree................................................................................................................. 81
5.2.3 Indexing Points ........................................................................................................ 83
5.3 ALGORITHMS ON THE PAT TREE...........................................................................83
5.3.1 Prefix Searching.......................................................................................................83
5.3.2 Proximity Searching ................................................................................................ 84
5.3.3 Range Searching...................................................................................................... 84
5.3.4 Longest Repetition Searching..................................................................................84
5.3.5 "Most Significant" or "Most Frequent" Searching ..................................................85
5.3.6 Regular Expression Searching................................................................................. 86
5.4 BUILDING PAT TREES AS PATRICIA TREES ........................................................87
5.5 PAT TREES REPRESENTED AS ARRAYS ...............................................................88
5.5.1 Searching PAT Trees as Arrays............................................................................... 89
5.5.2 Building PAT Trees as Arrays................................................................................. 89
Building PAT arrays in memory................................................................................... 90
Merging small against large PAT arrays ...................................................................... 90
5.5.3 Delayed Reading Paradigm ..................................................................................... 92
5.5.4 Merging Large Files ................................................................................................ 93
5.6 SUMMARY....................................................................................................................94
REFERENCES ..................................................................................................................... 95