<<深入搜索引擎>>源码资源-CSDN文库

深入搜索引擎

搜索引擎

5星 · 超过95%的资源需积分: 11 51 浏览量 2009-10-25 13:19:17 上传评论收藏 40KB GZ 举报

共23个文件

c：9个

1：5个

h：5个

《深入搜索引擎》与《Managing Gigabytes》是两本在信息技术领域具有深远影响力的著作，它们深入探讨了搜索引擎的设计原理和技术实现。这些书籍不仅为初学者提供了理解搜索引擎基础的平台，也为经验丰富的开发者提供了优化和改进现有搜索系统的宝贵见解。源码是书中理论知识的实践体现，通过对这些源码的学习，我们可以更直观地理解搜索引擎的工作机制。让我们来看看压缩包中包含的文件名：arith_coder.1、uint.1、word.1、char.1、bits.1、arith.c、stats.c、main.c、hashtable.c、word.c。这些文件名暗示了源码涉及的关键技术模块。 1. **arith_coder**：这通常涉及到一种叫做算术编码的数据压缩技术。算术编码是一种无损数据压缩方法，它通过概率模型来分配每个符号的编码长度，有效地减少了表示数据所需的位数。在搜索引擎中，算术编码可以用于索引构建过程中的词项编码，以节省存储空间。 2. **uint.1**、**word.1**、**char.1**、**bits.1**：这些文件可能包含了关于无符号整数（uint）、单词（word）、字符（char）和位操作（bits）的基础定义和操作函数。在搜索引擎中，这些基本数据类型和操作广泛应用于文本处理，例如分词、编码、排序和比较。 3. **arith.c**、**stats.c**：这两个文件很可能实现了算术编码的具体算法和统计功能。算术编码需要根据符号的概率分布进行计算，而统计则用于获取这些分布信息，比如在索引构建阶段对文档中词频的统计。 4. **main.c**：这是程序的主要入口点，包含了整个搜索引擎系统的核心逻辑。从这里，我们可以看到如何调用其他模块，如何初始化，以及如何执行搜索请求。 5. **hashtable.c**：哈希表是一种高效的数据结构，用于存储和查找数据。在搜索引擎中，哈希表常用于构建倒排索引，快速定位到包含特定词项的文档。 6. **word.c**：这个文件可能包含了处理单词（词项）的相关操作，如分词、词干提取、词形还原等，这些都是搜索引擎预处理文本的关键步骤。通过分析这些源码，我们可以学习到搜索引擎是如何进行文本预处理、建立倒排索引、执行搜索查询、以及进行结果排序的。此外，还会涉及数据压缩、内存管理和性能优化等方面的技术。深入理解这些内容，将有助于我们设计和实现更高效、更智能的搜索引擎。

资源推荐

资源详情

资源评论

收起资源包目录

arith_coder-3.tar.gz （23个子文件）

hashtable.c 10KB

main.h 2KB

main.c 18KB

WHATSNEW 8KB

arith.h 5KB

uint.1 21B

stats.h 4KB

arith_coder.1 6KB

word.1 21B

hashtable.h 3KB

arith.c 26KB

bits.c 6KB

char.c 3KB

stats.c 24KB

unroll.i 2KB

word.c 10KB

bitio.h 5KB

README 8KB

uint.c 5KB

bitio.c 3KB

char.1 21B

bits.1 21B

Makefile 2KB

/****************************************************************************** File: arith.c Authors: John Carpinelli (johnfc@ecr.mu.oz.au) Wayne Salamonsen (wbs@mundil.cs.mu.oz.au) Lang Stuiver (langs@cs.mu.oz.au) Radford Neal (radford@ai.toronto.edu) Purpose: Data compression using a revised arithmetic coding method. Based on: A. Moffat, R. Neal, I.H. Witten, "Arithmetic Coding Revisted", Proc. IEEE Data Compression Conference, Snowbird, Utah, March 1995. Low-Precision Arithmetic Coding Implementation by Radford M. Neal Copyright 1995 John Carpinelli and Wayne Salamonsen, All Rights Reserved. Copyright 1996 Lang Stuiver, All Rights Reserved. These programs are supplied free of charge for research purposes only, and may not sold or incorporated into any commercial product. There is ABSOLUTELY NO WARRANTY of any sort, nor any undertaking that they are fit for ANY PURPOSE WHATSOEVER. Use them at your own risk. If you do happen to find a bug, or have modifications to suggest, please report the same to Alistair Moffat, alistair@cs.mu.oz.au. The copyright notice above and this statement of conditions must remain an integral part of each and every copy made of these files. $Log: arith.c,v $ Revision 1.1 1996/08/07 01:34:11 langs Initial revision ************************************************************************ The following arithmetic coding routines contain two different methods for doing the required arithmetic. The method must be specified at compile time. The first method uses multiplication and division, the second simulates it with shifts and adds. The other decision is whether to specify the number of frequency and code bits (F_BITS and B_BITS, respectively) at run time or compile time. Fixing them at compile time gives improved performance, especially for the shift/add version. This is specified with the following flags (see the Makefile): 1. Multiplication and divison. (Fixed) -DMULT_DIV -DB_BITS=32 -DF_BITS=27 2. Multiplication and devision. (Variable) -DMULT_DIV -DVARY_NBITS 3. Shift/add arithmetic. (Fixed) -DB_BITS=32 -DF_BITS=27 4. Shift/add arithmetic. (Variable) -DVARY_NBITS FIXING BITS: Usually frequency bits and code bits would only be changed for testing, and in a standard implementation might well be fixed at compile time. The advantage here is that the shift/add loop can be unrolled, without need for testing and decrementing a counter variable. The loops are unrolled explicitly by calling an include file unroll.i. The compiler may well be able to unroll the loops automatically when they are written to be detectable (with -funroll-loops in gcc, for example). Our aim here was to emphasise what is going on at the machine level, and to ensure similar performance was obtained independent of the C compiler used. INPUT / OUTPUT: Version 2.0. The input and output streams are now separate, with variables relating to the streams prefixed "in_" and "out_" respectively. This means that decoding and encoding can be performed simultaneously. DIFFERENCE FROM CONVENTIONAL ARITHMETIC CODING: Approximate arithmetic allows a greater frequency range, at expense of compression size. The inefficiency is bounded by the maximum frequency and when B_bits=32 and F_bits=27, for example, this inefficency is negligible. If the shift/add optimisations are used, the total frequency count, t, must be partially normalised, so that 2^{f-1} < t <= 2^f, where f is frequency bits, the variable "F_bits". This partial normalisation is done by stats.c, and is the onus of the calling program if the coding routines are called directly. (This only applies to shift/add arithmetic; the original multiplication/ division arithmetic will still work correctly with arbitrary values for t). FRUGAL_BITS option: See comments in arith.h. ******************************************************************************/ #include <stdio.h> #include "arith.h" #include "bitio.h" #ifdef RCSID static char rcsid[] = "$Id: arith.c,v 1.1 1996/08/07 01:34:11 langs Exp $"; #endif /* Define coder_desc[] string, which describes the coder */ #define MULT_STR "Arithmetic coding (multiply with " #define SHIFT_STR "Arithmetic coding (Shift add with " #define VARY_STR "variable bits)" #define FIXED_STR "fixed bits)" #ifdef MULT_DIV # ifdef VARY_NBITS char *coder_desc = MULT_STR VARY_STR; # else char *coder_desc = MULT_STR FIXED_STR; # endif #else /* Shift add instead of multiply */ # ifdef VARY_NBITS char *coder_desc = SHIFT_STR VARY_STR; # else char *coder_desc = SHIFT_STR FIXED_STR; # endif #endif #if defined(VARY_NBITS) int B_bits = B_BITS; /* Default values */ int F_bits = F_BITS; static code_value Half; static code_value Quarter; #else # define Half ((code_value) 1 << (B_bits-1)) # define Quarter ((code_value) 1 << (B_bits-2)) #endif /* Separate input and output */ /* Input decoding state */ static code_value in_R; /* code range */ static code_value in_D; /* = V-L (V offset)*/ static div_value in_r; /* normalized range */ #ifdef FRUGAL_BITS static code_value in_V; /* Bitstream window */ #endif /* Output encoding state */ static code_value out_L; /* lower bound */ static code_value out_R; /* code range */ static unsigned long out_bits_outstanding; /* follow bit count */ /* * BIT_PLUS_FOLLOW(b) * responsible for outputting the bit passed to it and an opposite number of * bit equal to the value stored in bits_outstanding * */ #define ORIG_BIT_PLUS_FOLLOW(b) \ do \ { \ OUTPUT_BIT((b)); \ while (out_bits_outstanding > 0) \ { \ OUTPUT_BIT(!(b)); \ out_bits_outstanding--; \ } \ } while (0) #ifdef FRUGAL_BITS int _ignore_first_bit = 1; # define BIT_PLUS_FOLLOW(x) \ do \ { \ if (_ignore_first_bit) \ _ignore_first_bit = 0; \ else \ ORIG_BIT_PLUS_FOLLOW(x); \ } while (0) #else # define BIT_PLUS_FOLLOW(x) ORIG_BIT_PLUS_FOLLOW(x) #endif /* * ENCODE_RENORMALISE * output code bits until the range has been expanded * to above QUARTER * With FRUGAL_BITS option, ignore first zero bit output * (a redundant zero will otherwise be emitted every time the encoder is * started) */ #define ENCODE_RENORMALISE \ do { \ while (out_R <= Quarter) \ { \ if (out_L >= Half) \ { \ BIT_PLUS_FOLLOW(1); \ out_L -= Half; \ } \ else if (out_L+out_R <= Half) \ { \ BIT_PLUS_FOLLOW(0); \ } \ else \ { \ out_bits_outstanding++; \ out_L -= Quarter; \ } \ out_L <<= 1; \ out_R <<= 1; \ } \ } while (0) /* * DECODE_RENORMALISE * input code bits until range has been expanded to * more than QUARTER. Mimics encoder. * FRUGAL_BITS option also keeps track of bitstream input so it can work out * exactly how many disambiguating bits the encoder put out (1,2 or 3). */ #ifdef FRUGAL_BITS #define DECODE_RENORMALISE \ do { \ while (in_R <= Quarter) \ { \ in_R <<= 1; \ in_V <<= 1; \ ADD_NEXT_INPUT_BIT(in_D,B_bits); \ if (in_D & 1) \ in_V++; \ } \ } while (0) #else #define DECODE_RENORMALISE \ do { \ while (in_R <= Quarter) \ { \ in_R <<= 1; \ ADD_NEXT_INPUT_BIT(in_D,0); \ } \ } while (0) #endif /* * * encode a symbol given its low, high and total frequencies * */ void arithmetic_encode(freq_value low, freq_value high, freq_value total) { /* The following pseudocode is a concise (but slow due to arithmetic * cal

评论收藏

内容反馈