• Managing Gigabytes: Compressing and Indexing Documents and Images

    In this fully updated second edition of the highly acclaimed Managing Gigabytes, authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. It covers the latest developments in compression and indexing and their application on the Web and in digital libraries. It also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the Web. * Up-to-date coverage of new text compression algorithms such as block sorting, approximate arithmetic coding, and fat Huffman coding * New sections on content-based index compression and distributed querying, with 2 new data structures for fast indexing * New coverage of image coding, including descriptions of de facto standards in use on the Web (GIF and PNG), information on CALIC, the new proposed JPEG Lossless standard, and JBIG2 * New information on the Internet and WWW, digital libraries, web search engines, and agent-based retrieval * Accompanied by a public domain system called MG which is a fully worked-out operational example of the advanced techniques developed and explained in the book * New appendix on an existing digital library system that uses the MG software Editorial Reviews Amazon.com Review Of all the tasks programmers are asked to perform, storing, compressing, and retrieving information are some of the most challenging--and critical to many applications. Managing Gigabytes: Compressing and Indexing Documents and Images is a treasure trove of theory, practical illustration, and general discussion in this fascinating technical subject. Ian Witten, Alistair Moffat, and Timothy Bell have updated their original work with this even more impressive second edition. This version adds recent techniques such as block-sorting, new indexing techniques, new lossless compression strategies, and many other elements to the mix. In short, this work is a comprehensive summary of text and image compression, indexing, and querying techniques. The history of relevant algorithm development is woven well with a practical discussion of challenges, pitfalls, and specific solutions. This title is a textbook-style exposition on the topic, with its information organized very clearly into topics such as compression, indexing, and so forth. In addition to diagrams and example text transformations, the authors use "pseudo-code" to present algorithms in a language-independent manner wherever possible. They also supplement the reading with mg--their own implementation of the techniques. The mg C language source code is freely available on the Web. Alone, this book is an impressive collection of information. Nevertheless, the authors list numerous titles for further reading in selected topics. Whether you're in the midst of application development and need solutions fast or are merely curious about how top-notch information management is done, this hardcover is an excellent investment. --Stephen W. Plain Topics covered: Text compression models, including Huffman, LZW, and their variants; trends in information management; index creation and compression; image compression; performance issues; and overall system implementation. Review "This book is the Bible for anyone who needs to manage large data collections. It's required reading for our search gurus at Infoseek. The authors have done an outstanding job of incorporating and describing the most significant new research in information retrieval over the past five years into this second edition." —Steve Kirsch, Cofounder, Infoseek Corporation "The new edition of Witten, Moffat, and Bell not only has newer and better text search algorithms but much material on image analysis and joint image/text processing. If you care about search engines, you need this book: it is the only one with full details of how they work. The book is both detailed and enjoyable; the authors have combined elegant writing with top-grade programming." —Michael Lesk, National Science Foundation "The coverage of compression, file organizations, and indexing techniques for full text and document management systems is unsurpassed. Students, researchers, and practitioners will all benefit from reading this book." —Bruce Croft, Director, Center for Intelligent Information Retrieval at the University of Massachusetts From the Back Cover "This book is the Bible for anyone who needs to manage large data collections. It's required reading for our search gurus at Infoseek. The authors have done an outstanding job of incorporating and describing the most significant new research in information retrieval over the past five years into this second edition." Steve Kirsch, Cofounder, Infoseek Corporation "The new edition of Witten, Moffat, and Bell not only has newer and better text search algorithms but much material on image analysis and joint image/text processing. If you care about search engines, you need this book: it is the only one with full details of how they work. The book is both detailed and enjoyable; the authors have combined elegant writing with top-grade programming." Michael Lesk, National Science Foundation "The coverage of compression, file organizations, and indexing techniques for full text and document management systems is unsurpassed. Students, researchers, and practitioners will all benefit from reading this book." Bruce Croft, Director, Center for Intelligent Information Retrieval at the University of Massachusetts In this fully updated second edition of the highly acclaimed Managing Gigabytes, authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. It covers the latest developments in compression and indexing and their application on the Web and in digital libraries. It also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the Web. About the Author Ian H. Witten is a professor of computer science at the University of Waikato in New Zealand. He directs the New Zealand Digital Library research project. His research interests include information retrieval, machine learning, text compression, and programming by demonstration. He received an MA in Mathematics from Cambridge University, England; an MSc in Computer Science from the University of Calgary, Canada; and a PhD in Electrical Engineering from Essex University, England. He is a fellow of the ACM and of the Royal Society of New Zealand. He has published widely on digital libraries, machine learning, text compression, hypertext, speech synthesis and signal processing, and computer typography. He has written several books, the latest being Managing Gigabytes (1999) and Data Mining (2000), both from Morgan Kaufmann. 本书是斯坦福大学信息检索和挖掘课程的首选教材之一,并已成为全球主要大学信息检索的主要教材。本书理论和实践并重,深入浅出地给出了海量信息数据处理的整套解决方案,包括压缩、索引和查询的方方面面。其最大的特色在于不仅仅满足信息检索理论学习的需要,更重要的是给出了实践中可能面对的各种问题及其解决方法。. 本书作为斯坦福大学信息检索课程的教材之一,具有一定的阅读难度,主要面向信息检索专业高年级本科 生和研究生、搜索引擎业界的专业技术人员和从事海量数据处理相关专业的技术人员。... 目录 第1章 概览. 1 1.1 文档数据库(DOCUMENT DATABASES) 7 1.2 压缩(COMPRESSION) 10 1.3 索引(INDEXES) 12 1.4 文档索引 16 1.5 MG海量文档管理系统 20 1.6 进一步阅读 21 第2章 文本压缩 23 2.1 模型 26 2.2 自适应模型 29 2.3 哈夫曼编码 32 范式哈夫曼编码 38 计算哈夫曼编码长度 44 总结 51 2.4 算术编码 51 算术编码是如何工作的 53 实现算术编码 56 保存累积计数 59 2.5 符号模型 61 部分匹配预测 61 块排序压缩 64 动态马尔科夫压缩 69 基于单字的压缩 71 2.6 字典模型 73 自适应字典编码器的LZ77系列 74 LZ77的Gzip变体 78 自适应字典编码器的LZ78系列 79 LZ78的LZW变体 81 2.7 同步 84 创造同步点 84 自同步编码 87 2.8 性能比较 89 压缩性能 91 压缩速度 94 其他性能方面的考虑 97 2.9 进一步阅读 98 第3章 索引 102 3.1 样本文档集合 106 3.2 倒排文件索引 110 3.3 压缩倒排文件 115 无参模型(Nonparameterized models) 117 全局贝努里模型 120 全局观测频率模型(Global observed frequency model) 123 局部贝努里模型(Local Bernoulli model) 124 有偏贝努里模型(Skewed Bernoulli model) 125 局部双曲模型(Local hyperbolic model) 127 局部观测频率模型(Local observed frequency model) 128 上下文相关压缩(Context-sensitive compression) 130 3.4 索引压缩方法的效果 133 3.5 签名文件和位图 134 签名文件 135 位片签名文件(Bitsliced signature files) 139 签名文件分析 144 位图 147 签名文件和位图的压缩 148 3.6 索引方法的比较 151 3.7 大小写折迭. 词根化和停用词 153 大小写折迭 154 词根化 154 影响索引长度的因素 155 停用词(stop word) 156 3.8 进一步阅读 159 第4章 查询 162 4.1 访问字典的方法 166 访问数据结构 167 前端编码(Front coding) 170 最小完美哈希函数 173 完美哈希函数的设计 176 基于磁盘的字典存储 181 4.2 部分指定的查询术语 182 字符串暴力匹配(Brute-force string matching) 182 用n-gram索引 183 循环字典(Rotated lexicon) 184 4.3 布尔查询(BOOLEAN QUERY) 186 合取查询(conjunctive query) 187 术语处理顺序 188 随机访问和快速查找 189 分块倒排索引 192 非合取查询(Nonconjunctive query) 194 4.4 信息检索和排名 195 坐标匹配(Coordinate matching) 195 内积相似度 196 向量空间模型 202 4.5 检索效果评价 205 召回率和精确率 205 召回率-精确率曲线 207 TREC项目 208 万维网搜索(World Wide Web Searching) 212 其他有效性评价方法 215 4.6 余弦法实现 216 文档内频率 217 余弦值的计算方法 220 文档权重所需的内存 222 累加器内存 227 快速查询处理 228 按频率排序的索引 229 排序 233 4.7 交互式检索 236 相关性反馈 237 概率模型 239 4.8 分布式检索 241 4.9 进一步阅读 245 第5章 索引构造 248 计算模型 251 索引构造方法概览 252 5.1 基于内存的倒排 253 5.2 基于排序的倒排 256 5.3 索引压缩 261 压缩临时文件 261 多路归并 264 原地多路归并 265 5.4 压缩的内存内倒排.. 271 大内存倒排 271 基于字典的切分(Lexicon-based partitioning) 276 基于文本的切分 278 5.5 倒排方法的比较 281 5.6 构造签名文件和位图 282 5.7 动态文档集合 284 扩展文本(Expanding the text) 284 索引扩展(Expanding the index) 285 5.8 进一步阅读 290 第6章 图像压缩 292 6.1 图像类型 293 6.2 CCITT二值图像的传真标准 297 6.3 二值图像的上下文压缩 301 上下文模型 304 二值上下文模型 307 “超视力”压缩(Clairvoyant compression) 309 6.4 JBIG:二值图像标准 310 分辨率降低(Resolution reduction) 311 模板和自适应模板 316 编码及概率估计 317 6.5 连续色调图像的无损压缩 318 GIF和PNG无损图像格式 319 FELICS:快速. 有效且无损图像压缩系统 321 CALIC:基于上下文自适应无损图像解码器 325 JPEG-LS:无损图像压缩新标准 326 6.6 JPEG:连续色调图像标准 328 6.7 图像的递增传输 334 金字塔编码 335 金字塔编码的压缩 335 中位数聚合 337 误差模型 338 6.8 图像压缩技术总结 339 6.9 进一步阅读 341 第7章 文本图像 343 7.1 文本图像压缩概念 345 7.2 有损和无损压缩 349 7.3 标记抽取 351 跟踪标记的边界 351 清除图像中的标记 354 按自然阅读顺序排序标记 356 7.4 模板匹配 357 全局模板匹配 358 局部模板匹配 360 基于压缩的模板匹配 361 库模板筛法 364 评价模板匹配方法 365 7.5 从标记到符号 369 库构造 369 符号及其偏移量 371 7.6 编码文本图像分量 372 库 372 符号数 373 符号偏移 373 原始图像 374 7.7 效果:有损和无损的模式 376 7.8 系统考虑 381 7.9 JBIG2:图像文本压缩标准 383 7.10 进一步阅读 385 第8章 混合图文 386 8.1 方向 388 用Hough变换检测直线 389 左侧留白查找 391 投影轮廓 392 从斜率直方图到文本谱 397 8.2 切分 401 自下向上的切分方法 401 自上向下的组合的切分方法 403 基于标记的切分 404 使用短文本字符串切分 406 利用文本句法切分 409 8.3 分类 410 8.4 进一步阅读 413 第9章 系统实现 415 9.1 文本压缩 416 选择压缩模型 417 选择编码器 420 哈夫曼编码的限制 422 长度限制的编码 428 9.2 文本压缩效果 433 压缩有效性 433 解压速度 437 解压内存 437 动态文档集合 440 9.3 图像和文本图像 442 压缩二值图像 444 压缩灰度图像 445 压缩文本图像 445 9.4 构造索引 447 9.5 索引压缩 449 9.6 查询处理 451 布尔查询 451 排名查询 454 9.7 进一步阅读 456 第10章 信息爆炸 458 10.1 信息技术发展2 000年 458 10.2 INTERNET:一种全球信息资源 460 10.3 纸张问题 463 10.4 面对信息爆炸 465 网页搜索引擎 465 基于代理的信息检索 467 数据挖掘 469 10.5 数字图书馆 469 10.6 更好地管理海量数据 471 10.7 小就是美 473 10.8 对生活的个人信息支持 475 10.9 进一步阅读 476 附录A MG系统指南 478 A.1 安装MG系统 478 A.2 一个简单的存储和检索例子 480 A.3 数据库创建 485 A.4 对一个索引文档集合进行查询 489 A.5 非文本文件 491 A.6 图像压缩程序 493 附录B 新西兰图书馆 494 B.1 什么是NZDL 494 其他文档集合 497 文档集合的发展 501 音频集合(audio collections) 502 音调索引(Melody Index) 503 B.2 NZDL是如何工作的? 505 原始文档 505 搜索和索引 506 B.3 影响 508 B.4 进一步阅读... 508

    5
    374
    9.19MB
    2012-02-16
    15
关注 私信
上传资源赚积分or赚钱