FastExactSearchinHammingSpacewithMulti-IndexHashing资源-CSDN文库

共35个文件

h：14个

cpp：11个

m：5个

Multi-Index

Hashing

1星需积分: 24 89 浏览量 2017-12-27 15:29:35 上传评论收藏 7.32MB ZIP 举报

《基于Multi-Index Hashing的汉明空间快速精确搜索技术详解》在计算机科学与信息技术领域，尤其是在数据挖掘、图像处理以及模式识别等应用中，高效检索和匹配大量数据是至关重要的任务。"Fast Exact Search in Hamming Space with Multi-Index Hashing"是一种针对汉明空间（Hamming Space）中的数据进行快速精确搜索的技术，它通过引入多索引哈希（Multi-Index Hashing, MIH）方法来显著提升搜索效率。汉明空间是所有二进制字符串的集合，每个字符串的长度相同。字符串之间的距离，即汉明距离，定义为两个字符串对应位置上不同字符的个数。在生物信息学、密码学和编码理论等领域，汉明距离被广泛用于衡量数据的相似性。然而，随着数据量的增大，直接计算所有数据对的汉明距离变得极其耗时。因此，开发快速的搜索算法显得尤为重要。 Multi-Index Hashing是解决这一问题的一种策略。MIH的核心思想是将原始数据转换成多个哈希值，这些哈希值分别存储在不同的索引中。通过使用多个独立的哈希函数，MIH可以减少误匹配的概率，并且通过并行查询这些索引来加速搜索过程。这种方法的关键在于设计合适的哈希函数和索引结构，以确保在保持高召回率的同时，尽可能减少比较次数。在提供的"mih-master"压缩包中，包含了该技术的具体源代码实现。源代码通常分为以下几个部分： 1. 数据预处理：对输入的数据集执行预处理步骤，包括生成哈希索引和构建索引结构。这通常涉及到选择或设计合适的哈希函数，以及确定最佳的索引数量和大小。 2. 哈希函数：哈希函数是MIH的核心，其目的是将高维数据映射到低维空间，同时尽可能保留数据的相似性信息。在MIH中，通常会使用多个不同的哈希函数以提高搜索效率。 3. 查询优化：在搜索阶段，通过同时查询多个索引来快速定位可能的匹配项。这一步通常涉及并行计算和索引的联合查询策略，以减少比较次数。 4. 结果验证：找到潜在匹配后，还需要通过计算实际的汉明距离来验证匹配的准确性。这一步是必要的，因为哈希碰撞可能导致错误的匹配。 5. 性能评估：通过实验对比不同参数设置下的搜索效率和精度，评估算法的性能。在实际应用中，MIH可以适应各种应用场景，如近似最近邻搜索、大规模图像数据库的相似性搜索等。通过深入理解并优化MIH的实现细节，我们可以进一步提升其在大数据环境下的性能，为现代信息技术提供强大的工具支持。

资源推荐

资源详情

资源评论

收起资源包目录

mih-master.zip （35个子文件）

mih-master

license.txt 936B

data

lsh_64_sift_1M.mat 7.3MB

test

test_bucket_group.cpp 1KB

hammingDist.m 1KB

test_mih_with_linscan.m 2KB

include

linscan.h 407B

result.h 288B

memusage.h 1KB

mihasher.h 2KB

myhdf5.h 74B

array32.h 817B

resulth5.h 477B

bitops.h 3KB

bitarray.h 846B

types.h 419B

reorder.h 303B

io.h 530B

bucket_group.h 650B

sparse_hashtable.h 877B

matlab

compactbit.m 642B

plot_time.m 4KB

RUN.sh 3KB

src

reorder.cpp 5KB

mihasher.cpp 7KB

array32.cpp 3KB

linscan.cpp 3KB

sparse_hashtable.cpp 2KB

bucket_group.cpp 5KB

create_lsh_codes.m 5KB

interface

saveRes.cpp 6KB

linscan_interface.cpp 5KB

loadVar.cpp 3KB

mih_interface.cpp 7KB

CMakeLists.txt 643B

README.md 5KB

Multi Index Hashing (MIH) ======= An implementation of *"Fast Exact Search in Hamming Space with Multi-Index Hashing, M. Norouzi, A. Punjani, D. J. Fleet, IEEE TPAMI 2014"*. See http://www.cs.toronto.edu/~norouzi/research/mih/. This algorithm performs fast exact nearest neighbor search in Hamming distance on binary codes. Using this code, one can re-run the experiments described in the paper. For best results, consider using *libhugetlbfs* with multi-index hashing. ### Compilation You need make, cmake, hdf5 library, hdf5-dev package to build this project. To compile, create a folder called `build`, and run: ``` cd build rm * -rf cmake .. make ``` This should generate two binary files: `mih` and `linscan` ### Datasets An example binary code dataset with 1 million 64-bit codes from SIFT is stored in the data folder. To generate larger binary code datasets, one should download raw data which can be converted to binary codes using hashing techniques (e.g., LSH or MLH). For example, download the INRIA bigann dataset (1 billion SIFT features) from http://corpus-texmex.irisa.fr/ and store it under data/inria/. You can also download the Tiny images dataset (80 million GIST descriptors) from http://horatio.cs.nyu.edu/mit/tiny/data/index.html and store it under data/tiny. By running create_lsh_codes.m (a matlab snippet) one can generate binary codes from the above datasets using random projections (LSH, "Similarity estimation techniques from rounding algorithms, M. Charikar, STOC. 2002"). By changing the first few lines of create_lsh_codes, you can control the parameters of the matlab snippet. The output is in matlab (version 7.3) format, which is essentially hdf5 format. Hence, we use hdf5 library to read the binary code datasets. ### Usage `RUN.sh` is a bash script showing an example run of the program 64-bit codes. Set the parameters `nb`, `HUGE`, `hashfunc`, etc. to change the setting. `RUN.sh` includes suggested values for `m`: number of hash tables. ##### Linear scan `linscan` provides an efficient implementation of exhaustive linear scan for kNN in Hamming distance on binary codes. This serves as a good baseline. ``` linscan <infile> <outfile> [options] Options: -N <number> Set the number of binary codes from the beginning of the dataset file to be used -B <number> Set the number of bits per code, default autodetect -Q <number> Set the number of query points to use from <infile>, default all -K <number> Set number of nearest neighbors to be retrieved ``` Examples: ``` ./build/linscan data/lsh_64_sift_1M.mat linscan_64_1M.h5 -N 100000 -B 64 -Q 1000 -K 100 ./build/linscan data/lsh_64_sift_1M.mat linscan_64_1M.h5 -N 1000000 -B 64 -Q 1000 -K 100 ``` Assuming that a dataset of 128-bit binary codes is stored at `codes/lsh_64_sift_1M.mat`, running the above lines will create an output file `linscan_64_1M.h5`, which stores the results and timings for 100-NN search on 100K and 1M binary codes. If the output file does not exist (the first time), the output file is created. If the output file exists (since the second time), the file is appended with the new results. ##### Multi Index Hashing `mih` provides an implementation of multi-index hashing for fast exact kNN in Hamming distance on binary codes. ``` mih <infile> <outfile> [options] Options: -N <number> Set the number of binary codes from the beginning of the dataset file to be used -B <number> Set the number of bits per code, default autodetect -Q <number> Set the number of query points to use from <infile>, default all -m <number> Set the number of chunks to use, default 1 -K <number> Set number of nearest neighbors to be retrieved -R <number> Set the number of codes (in Millions) to use in computing the optimal bit reordering, default OFF (0) ``` Examples: ``` ./build/mih data/lsh_64_sift_1M.mat mih_64_1M.h5 -N 100000 -B 64 -m 5 -Q 10000 -K 100 ./build/mih data/lsh_64_sift_1M.mat mih_64_1M.h5 -N 1000000 -B 64 -m 4 -Q 10000 -K 100 ``` The mih's options are very similar to linscan. It has an additional argument (-m) to determine the number of hash tables. It also has a flag (-R) to determine whether the assignment of bits to the substrings should be optimized. ### FAQs Q: I have tried your code with some of my datasets. It works well when I used small datasets, but it does not perform well with large datasets. A: Did you try decreasing the number of hash tables (by the -m switch) as you increased the size of the database? My experience is that with the correct choice of m, the speedup on larger datasets should be much better. In the RUN.sh file, I have a set of suggestions for the values of m for different number of codes in the datasets. ### License Copyright (c) 2012, Mohammad Norouzi [<mohammad.n@gmail.com>] and Ali Punjani [<alipunjani@cs.toronto.edu>]. This is a free software; for license information please refer to license.txt file. ### TODO - Automatic suggestion for the m parameter. - Multi-core functionality. - Improve SparseHashtable insertion speed. It is currently very slow, but can be improved in different ways.

评论收藏

内容反馈