[![Build Status](https://travis-ci.org/Tessil/hat-trie.svg?branch=master)](https://travis-ci.org/Tessil/hat-trie) [![Build status](https://ci.appveyor.com/api/projects/status/ieafyj08ewb7dfa7/branch/master?svg=true)](https://ci.appveyor.com/project/Tessil/hat-trie/branch/master)
## A C++ implementation of a fast and memory efficient HAT-trie
https://github.com/Tessil/hat-trie
Trie implementation based on the "HAT-trie: A Cache-conscious Trie-based Data Structure for Strings." (Askitis Nikolas and Sinha Ranjan, 2007) paper. For now, only the pure HAT-trie has been implemented, the hybrid version may arrive later. Details regarding the HAT-trie data structure can be found [here](https://tessil.github.io/2017/06/22/hat-trie.html).
The library provides an efficient and compact way to store a set or a map of strings by compressing the common prefixes. It also allows to search for keys that match a prefix. Note though that the default parameters of the structure are geared toward optimizing exact searches, if you do a lot of prefix searches you may want to reduce the burst threshold through the `burst_threshold` method.
It's a well adapted structure to store a large number of strings.
<p align="center">
<img src="https://tessil.github.io/images/hat-trie.png" width="600px" />
</p>
For the array hash part, the [array-hash](https://github.com/Tessil/array-hash) project is used and included in the repository.
The library provides two classes: `tsl::htrie_map` and `tsl::htrie_set`.
### Overview
- Header-only library, just add the [include](include/) directory to your include path and you are ready to go. If you use CMake, you can also use the `tsl::hat_trie` exported target from the [CMakeLists.txt](CMakeLists.txt).
- Low memory usage while keeping reasonable performances (see [benchmark](#benchmark)).
- Support prefix searches through `equal_prefix_range` (usefull for autocompletion for example) and prefix erasures through `erase_prefix`.
- Support longest matching prefix searches through `longest_prefix`.
- Support for efficient serialization and deserialization (see [example](#serialization) and the `serialize/deserialize` methods in the [API](https://tessil.github.io/hat-trie/doc/html/classtsl_1_1htrie__map.html) for details).
- Keys are not ordered as they are partially stored in a hash map.
- All operations modifying the data structure (insert, emplace, erase, ...) invalidate the iterators.
- Support null characters in the key (you can thus store binary data in the trie).
- Support for any type of value as long at it's either copy-constructible or both nothrow move constructible and nothrow move assignable.
- The balance between speed and memory usage can be modified through the `max_load_factor` method. A lower max load factor will increase the speed, a higher one will reduce the memory usage. Its default value is set to 8.0.
- The default burst threshold, which is the maximum size of an array hash node before a burst occurs, is set to 16 384 which provides good performances for exact searches. If you mainly use prefix searches, you may want to reduce it to something like 1024 or lower for faster iteration on the results through the `burst_threshold` method.
- By default the maximum allowed size for a key is set to 65 535. This can be raised through the `KeySizeT` template parameter.
Thread-safety and exception guarantees are similar to the STL containers.
### Hash function
The default hash function used by the structure depends on the presence of `std::string_view`. If it is available, `std::hash<std::string_view>` is used, otherwise a simple [FNV-1a](https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash) hash function is used to avoid any dependency.
If you can't use C++17 or later, we recommend to replace the hash function with something like [CityHash](https://github.com/google/cityhash), MurmurHash, [FarmHash](https://github.com/google/farmhash), ... for better performances. On the tests we did, CityHash64 offers a ~20% improvement on reads compared to FNV-1a.
```c++
#include <city.h>
struct str_hash {
std::size_t operator()(const char* key, std::size_t key_size) const {
return CityHash64(key, key_size);
}
};
tsl::htrie_map<char, int, str_hash> map;
```
The `std::hash<std::string>` can't be used efficiently as the structure doesn't store any `std::string` object. Any time a hash would be needed, a temporary `std::string` would have to be created.
### Benchmark
#### Wikipedia dataset
The benchmark consists in inserting all the titles from the main namespace of the Wikipedia archive into the data structure, check the used memory space after the insert (including potential memory fragmentation) and search for all the titles again in the data structure. The peak memory usage during the insert process is also measured with [time(1)](https://linux.die.net/man/1/time).
* Dataset: [enwiki-20170320-all-titles-in-ns0.gz](https://dumps.wikimedia.org/enwiki/20170320/)
* Size: 262.7 MiB
* Number of keys: 13 099 148
* Average key length: 19.90
* Median key length: 17
* Max key length: 251
Each title is associated with an int (32 bits). All the hash based structures use [CityHash64](https://github.com/google/cityhash) as hash function. For the tests marked *with reserve*, the `reserve` function is called beforehand to avoid any rehash.
Note that `tsl::hopscotch_map`, `std::unordered_map`, `google::dense_hash_map` and `spp::sparse_hash_map` use `std::string` as key which imposes a minimum size of 32 bytes (on x64) even if the key is only one character long. Other structures may be able to store one-character keys with 1 byte + 8 bytes for a pointer (on x64).
The benchmark was compiled with GCC 6.3 and ran on Debian Stretch x64 with an Intel i5-5200u and 8Go of RAM. Best of 20 runs was taken.
The code of the benchmark can be found on [Gist](https://gist.github.com/Tessil/72e11891fc155f5b2eb53de22cbc4053).
##### Unsorted
The *enwiki-20170320-all-titles-in-ns0.gz* dataset is alphabetically sorted. For this benchmark, we first shuffle the dataset through [shuf(1)](https://linux.die.net/man/1/shuf) to avoid a biased sorted dataset.
| Library | Data structure | Peak memory (MiB) | Memory (MiB) | Insert (ns/key) | Read (ns/key) |
|---------|----------------|------------------:|-------------:|----------------:|--------------:|
| [tsl::htrie_map](https://github.com/Tessil/hat-trie) | HAT-trie | **405.22** | **402.25** | 643.10 | 250.87 |
| [tsl::htrie_map](https://github.com/Tessil/hat-trie) <br/> max_load_factor=4 | HAT-trie | 471.85 | 468.50 | 638.66 | 212.90 |
| [tsl::htrie_map](https://github.com/Tessil/hat-trie) <br/> max_load_factor=2 | HAT-trie | 569.76 | 566.52 | 630.61 | 201.10 |
| [tsl::htrie_map](https://github.com/Tessil/hat-trie) <br/> max_load_factor=1 | HAT-trie | 713.44 | 709.81 | 645.76 | 190.87 |
| [cedar::da](http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/) | Double-array trie | 1269.68 | 1254.41 | 1102.93 | 557.20 |
| [cedar::da](http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/) ORDERED=false | Double-array trie | 1269.80 | 1254.41 | 1089.78 | 570.13 |
| [cedar::da](http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/) | Double-array reduced trie | 1183.07 | 1167.79 | 1076.68 | 645.79 |
| [cedar::da](http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/) ORDERED=false | Double-array reduced trie | 1183.14 | 1167.85 | 1065.43 | 641.98 |
| [cedar::da](http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/) | Double-array prefix trie | 498.69 | 496.54 | 1096.90 | 628.01 |
| [cedar::da](http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/) ORDERED=false | Double-array prefix trie | 498.65 | 496.60 | 1048.40 | 628.94 |
| [hat-trie](https://github.com/dcjones/hat-trie)<sup>1</sup> (C) | HAT-trie | 504.07 | 501.50 | 917.49 | 261.00 |
| [qp trie](https://github.com/fanf2/qp) (C) | QP trie | 941.23 | 938.17 | 1349.25 | 1281.46 |
| [crit-bit trie](https://github.com/fanf2/qp) (C) | Crit-bit trie | 1074.96 |
没有合适的资源?快使用搜索试试~ 我知道了~
Chatopera 语义理解系统:机器学习,聊天机器人,意图识别clause-osc.zip
共2000个文件
html:407个
png:228个
md5:202个
需积分: 5 0 下载量 108 浏览量
2024-08-19
10:13:18
上传
评论
收藏 111.69MB ZIP 举报
温馨提示
Chatopera 语义理解系统:机器学习,聊天机器人,意图识别clause-osc.zip
资源推荐
资源详情
资源评论
收起资源包目录
Chatopera 语义理解系统:机器学习,聊天机器人,意图识别clause-osc.zip (2000个子文件)
libjsoncpp.so.0 20B
flagfile.1 9B
flagfile.2 25B
flagfile.3 21B
libjsoncpp.so.0.10.6 273KB
libjsoncpp.a 428KB
configure.ac 6KB
configure.ac 3KB
configure.ac 461B
Makefile.am 11KB
Makefile.am 7KB
Makefile.am 315B
.astylerc 207B
AUTHORS 318B
BUILD.bazel 13KB
BUILD.bazel 5KB
BUILD.bazel 3KB
BUILD 513B
BUILD 113B
glog.bzl 4KB
gflags.bzl 4KB
rumavl.c 34KB
lookup3.c 34KB
crf1d_encode.c 29KB
crf1d_model.c 26KB
crf1d_context.c 21KB
cqdb.c 17KB
crf1d_tag.c 16KB
learn.c 16KB
train_l2sgd.c 15KB
crfsuite.c 14KB
tag.c 13KB
train_passive_aggressive.c 12KB
train_arow.c 11KB
train_lbfgs.c 10KB
crf1d_feature.c 10KB
params.c 10KB
crfsuite_train.c 8KB
train_averaged_perceptron.c 7KB
iwa.c 7KB
iwa.c 6KB
quark.c 5KB
main.c 4KB
reader.c 4KB
dictionary.c 4KB
dump.c 4KB
option.c 3KB
dataset.c 3KB
holdout.c 3KB
logging.c 3KB
gtest.cbproj 10KB
gtest_unittest.cbproj 9KB
gtest_main.cbproj 8KB
gtest_unittest.cc 244KB
gmock-matchers_test.cc 219KB
gtest.cc 215KB
gtest_pred_impl_unittest.cc 76KB
gflags.cc 74KB
gmock-spec-builders_test.cc 73KB
gtest-death-test.cc 57KB
googletest-printers-test.cc 56KB
gflags_unittest.cc 52KB
googletest-death-test-test.cc 44KB
gtest-port.cc 44KB
gmock-generated-matchers_test.cc 43KB
gmock-actions_test.cc 43KB
gmock-generated-actions_test.cc 40KB
googletest-port-test.cc 39KB
googletest-param-test-test.cc 39KB
googletest-output-test_.cc 34KB
gmock-spec-builders.cc 32KB
gflags_completions.cc 27KB
gmock-internal-utils_test.cc 25KB
gmock-more-actions_test.cc 24KB
googletest-filepath-test.cc 22KB
gmock-matchers.cc 21KB
gmock-generated-function-mockers_test.cc 20KB
gflags_reporting.cc 17KB
gmock-nice-strict_test.cc 15KB
gtest-printers.cc 15KB
gtest-typed-test_test.cc 15KB
gtest-filepath.cc 14KB
gtest-unittest-api_test.cc 13KB
gmock-cardinalities_test.cc 12KB
googletest-listener-test.cc 10KB
gtest_stress_test.cc 9KB
gmock_stress_test.cc 9KB
googletest-tuple-test.cc 9KB
sample6_unittest.cc 9KB
gmock_output_test_.cc 8KB
googletest-catch-exceptions-test_.cc 8KB
googletest-test-part-test.cc 8KB
gmock.cc 8KB
googletest-options-test.cc 7KB
gmock-internal-utils.cc 7KB
gtest_repeat_test.cc 7KB
sample8_unittest.cc 7KB
sample5_unittest.cc 6KB
gtest_environment_test.cc 6KB
gmock_test.cc 6KB
共 2000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20
资源评论
蜡笔小流
- 粉丝: 2336
- 资源: 1186
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功