# kenlm
Language model inference code by Kenneth Heafield (kenlm at kheafield.com)
The website https://kheafield.com/code/kenlm/ has more documentation. If you're a decoder developer, please download the latest version from there instead of copying from another decoder.
## Compiling
Use cmake, see [BUILDING](BUILDING) for build dependencies and more detail.
```bash
mkdir -p build
cd build
cmake ..
make -j 4
```
## Compiling with your own build system
If you want to compile with your own build system (Makefile etc) or to use as a library, there are a number of macros you can set on the g++ command line or in util/have.hh .
* `KENLM_MAX_ORDER` is the maximum order that can be loaded. This is done to make state an efficient POD rather than a vector.
* `HAVE_ICU` If your code links against ICU, define this to disable the internal StringPiece and replace it with ICU's copy of StringPiece, avoiding naming conflicts.
ARPA files can be read in compressed format with these options:
* `HAVE_ZLIB` Supports gzip. Link with -lz.
* `HAVE_BZLIB` Supports bzip2. Link with -lbz2.
* `HAVE_XZLIB` Supports xz. Link with -llzma.
Note that these macros impact only `read_compressed.cc` and `read_compressed_test.cc`. The bjam build system will auto-detect bzip2 and xz support.
## Estimation
lmplz estimates unpruned language models with modified Kneser-Ney smoothing. After compiling with bjam, run
```bash
bin/lmplz -o 5 <text >text.arpa
```
The algorithm is on-disk, using an amount of memory that you specify. See https://kheafield.com/code/kenlm/estimation/ for more.
MT Marathon 2012 team members Ivan Pouzyrevsky and Mohammed Mediani contributed to the computation design and early implementation. Jon Clark contributed to the design, clarified points about smoothing, and added logging.
## Filtering
filter takes an ARPA or count file and removes entries that will never be queried. The filter criterion can be corpus-level vocabulary, sentence-level vocabulary, or sentence-level phrases. Run
```bash
bin/filter
```
and see https://kheafield.com/code/kenlm/filter/ for more documentation.
## Querying
Two data structures are supported: probing and trie. Probing is a probing hash table with keys that are 64-bit hashes of n-grams and floats as values. Trie is a fairly standard trie but with bit-level packing so it uses the minimum number of bits to store word indices and pointers. The trie node entries are sorted by word index. Probing is the fastest and uses the most memory. Trie uses the least memory and is a bit slower.
As is the custom in language modeling, all probabilities are log base 10.
With trie, resident memory is 58% of IRST's smallest version and 21% of SRI's compact version. Simultaneously, trie CPU's use is 81% of IRST's fastest version and 84% of SRI's fast version. KenLM's probing hash table implementation goes even faster at the expense of using more memory. See https://kheafield.com/code/kenlm/benchmark/.
Binary format via mmap is supported. Run `./build_binary` to make one then pass the binary file name to the appropriate Model constructor.
## Platforms
`murmur_hash.cc` and `bit_packing.hh` perform unaligned reads and writes that make the code architecture-dependent.
It has been sucessfully tested on x86\_64, x86, and PPC64.
ARM support is reportedly working, at least on the iphone.
Runs on Linux, OS X, Cygwin, and MinGW.
Hideo Okuma and Tomoyuki Yoshimura from NICT contributed ports to ARM and MinGW.
## Decoder developers
- I recommend copying the code and distributing it with your decoder. However, please send improvements upstream.
- It's possible to compile the query-only code without Boost, but useful things like estimating models require Boost.
- Select the macros you want, listed in the previous section.
- There are two build systems: compile.sh and cmake. They're pretty simple and are intended to be reimplemented in your build system.
- Use either the interface in `lm/model.hh` or `lm/virtual_interface.hh`. Interface documentation is in comments of `lm/virtual_interface.hh` and `lm/model.hh`.
- There are several possible data structures in `model.hh`. Use `RecognizeBinary` in `binary_format.hh` to determine which one a user has provided. You probably already implement feature functions as an abstract virtual base class with several children. I suggest you co-opt this existing virtual dispatch by templatizing the language model feature implementation on the KenLM model identified by `RecognizeBinary`. This is the strategy used in Moses and cdec.
- See `lm/config.hh` for run-time tuning options.
## Contributors
Contributions to KenLM are welcome. Please base your contributions on https://github.com/kpu/kenlm and send pull requests (or I might give you commit access). Downstream copies in Moses and cdec are maintained by overwriting them so do not make changes there.
## Python module
Contributed by Victor Chahuneau.
### Installation
```bash
pip install https://github.com/kpu/kenlm/archive/master.zip
```
### Basic Usage
```python
import kenlm
model = kenlm.Model('lm/test.arpa')
print(model.score('this is a sentence .', bos = True, eos = True))
```
See [python/example.py](python/example.py) and [python/kenlm.pyx](python/kenlm.pyx) for more, including stateful APIs.
---
The name was Hieu Hoang's idea, not mine.
没有合适的资源?快使用搜索试试~ 我知道了~
kenlm-kenlm
共316个文件
hh:116个
cc:112个
txt:13个
需积分: 1 0 下载量 142 浏览量
2022-08-08
06:41:41
上传
评论
收藏 3.45MB ZIP 举报
温馨提示
kenlm-kenlm
资源详情
资源评论
资源推荐
收起资源包目录
kenlm-kenlm (316个子文件)
toy1.1 72B
toy1.1 72B
toy0.1 60B
toy0.1 60B
toy0.2 112B
toy1.2 112B
toy0.2 112B
toy1.2 112B
COPYING.3 34KB
COPYING.LESSER.3 7KB
toy0.3 140B
toy0.3 140B
toy1.3 120B
toy1.3 120B
test.arpa 3KB
test_nounk.arpa 3KB
toy0.arpa 475B
toy1.arpa 470B
BUILDING 696B
getopt.c 1KB
fast-dtoa.cc 31KB
string-to-double.cc 27KB
bignum-dtoa.cc 27KB
bignum.cc 24KB
strtod.cc 23KB
search_trie.cc 23KB
integer_to_string.cc 22KB
tune_instances.cc 19KB
file.cc 18KB
pipeline.cc 17KB
model.cc 16KB
double-to-string.cc 15KB
fixed-dtoa.cc 15KB
model_test.cc 14KB
normalize.cc 13KB
mmap.cc 13KB
read_compressed.cc 13KB
binary_format.cc 13KB
adjust_counts.cc 12KB
left_test.cc 12KB
search_hashed.cc 12KB
lmplz_main.cc 12KB
probing_hash_table_benchmark_main.cc 12KB
usage.cc 11KB
vocab.cc 11KB
trie_sort.cc 11KB
file_piece.cc 11KB
initial_probabilities.cc 10KB
merge_probabilities.cc 10KB
cached-powers.cc 10KB
corpus_count.cc 9KB
filter_main.cc 9KB
kenlm_benchmark_main.cc 9KB
phrase.cc 8KB
build_binary_main.cc 8KB
pipeline.cc 7KB
read_arpa.cc 7KB
streaming_example_main.cc 7KB
partial_test.cc 7KB
interpolate.cc 6KB
tune_derivatives.cc 6KB
trie.cc 6KB
string_piece.cc 5KB
file_piece_test.cc 5KB
merge_vocab_test.cc 5KB
interpolate_main.cc 5KB
model_buffer.cc 5KB
phrase_table_vocab_main.cc 5KB
backoff_reunification_test.cc 5KB
query_main.cc 5KB
tune_instances_test.cc 4KB
chain.cc 4KB
nplm.cc 4KB
murmur_hash.cc 4KB
count_ngrams_main.cc 4KB
tune_derivatives_test.cc 4KB
quantize.cc 4KB
bhiksha.cc 4KB
rewindable_stream.cc 4KB
sorted_uniform_test.cc 4KB
merge_vocab.cc 4KB
adjust_counts_test.cc 3KB
bounded_sequence_encoding_test.cc 3KB
corpus_count_test.cc 3KB
exception.cc 3KB
normalize_test.cc 3KB
read_compressed_test.cc 3KB
sizes.cc 3KB
io.cc 2KB
multi_progress.cc 2KB
probing_hash_table_test.cc 2KB
value_build.cc 2KB
integer_to_string_test.cc 2KB
backoff_reunification.cc 2KB
arpa_io.cc 2KB
print.cc 2KB
multi_intersection_test.cc 2KB
dump_counts_main.cc 2KB
string_stream_test.cc 2KB
joint_sort_test.cc 2KB
共 316 条
- 1
- 2
- 3
- 4
baicaoyin
- 粉丝: 3
- 资源: 15
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0