kenlm-kenlm资源-CSDN文库

共316个文件

hh：116个

cc：112个

txt：13个

kenlm

需积分: 1 142 浏览量 2022-08-08 06:41:41 上传评论收藏 3.45MB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

kenlm-kenlm （316个子文件）

toy1.1 72B

toy0.1 60B

toy0.2 112B

toy1.2 112B

toy0.2 112B

toy1.2 112B

COPYING.3 34KB

COPYING.LESSER.3 7KB

toy0.3 140B

toy1.3 120B

test.arpa 3KB

test_nounk.arpa 3KB

toy0.arpa 475B

toy1.arpa 470B

BUILDING 696B

getopt.c 1KB

fast-dtoa.cc 31KB

string-to-double.cc 27KB

bignum-dtoa.cc 27KB

bignum.cc 24KB

strtod.cc 23KB

search_trie.cc 23KB

integer_to_string.cc 22KB

tune_instances.cc 19KB

file.cc 18KB

pipeline.cc 17KB

model.cc 16KB

double-to-string.cc 15KB

fixed-dtoa.cc 15KB

model_test.cc 14KB

normalize.cc 13KB

mmap.cc 13KB

read_compressed.cc 13KB

binary_format.cc 13KB

adjust_counts.cc 12KB

left_test.cc 12KB

search_hashed.cc 12KB

lmplz_main.cc 12KB

probing_hash_table_benchmark_main.cc 12KB

usage.cc 11KB

vocab.cc 11KB

trie_sort.cc 11KB

file_piece.cc 11KB

initial_probabilities.cc 10KB

merge_probabilities.cc 10KB

cached-powers.cc 10KB

corpus_count.cc 9KB

filter_main.cc 9KB

kenlm_benchmark_main.cc 9KB

phrase.cc 8KB

build_binary_main.cc 8KB

pipeline.cc 7KB

read_arpa.cc 7KB

streaming_example_main.cc 7KB

partial_test.cc 7KB

interpolate.cc 6KB

tune_derivatives.cc 6KB

trie.cc 6KB

string_piece.cc 5KB

file_piece_test.cc 5KB

merge_vocab_test.cc 5KB

interpolate_main.cc 5KB

model_buffer.cc 5KB

phrase_table_vocab_main.cc 5KB

backoff_reunification_test.cc 5KB

query_main.cc 5KB

tune_instances_test.cc 4KB

chain.cc 4KB

nplm.cc 4KB

murmur_hash.cc 4KB

count_ngrams_main.cc 4KB

tune_derivatives_test.cc 4KB

quantize.cc 4KB

bhiksha.cc 4KB

rewindable_stream.cc 4KB

sorted_uniform_test.cc 4KB

merge_vocab.cc 4KB

adjust_counts_test.cc 3KB

bounded_sequence_encoding_test.cc 3KB

corpus_count_test.cc 3KB

exception.cc 3KB

normalize_test.cc 3KB

read_compressed_test.cc 3KB

sizes.cc 3KB

io.cc 2KB

multi_progress.cc 2KB

probing_hash_table_test.cc 2KB

value_build.cc 2KB

integer_to_string_test.cc 2KB

backoff_reunification.cc 2KB

arpa_io.cc 2KB

print.cc 2KB

multi_intersection_test.cc 2KB

dump_counts_main.cc 2KB

string_stream_test.cc 2KB

joint_sort_test.cc 2KB

共 316 条

# kenlm Language model inference code by Kenneth Heafield (kenlm at kheafield.com) The website https://kheafield.com/code/kenlm/ has more documentation. If you're a decoder developer, please download the latest version from there instead of copying from another decoder. ## Compiling Use cmake, see [BUILDING](BUILDING) for build dependencies and more detail. ```bash mkdir -p build cd build cmake .. make -j 4 ``` ## Compiling with your own build system If you want to compile with your own build system (Makefile etc) or to use as a library, there are a number of macros you can set on the g++ command line or in util/have.hh . * `KENLM_MAX_ORDER` is the maximum order that can be loaded. This is done to make state an efficient POD rather than a vector. * `HAVE_ICU` If your code links against ICU, define this to disable the internal StringPiece and replace it with ICU's copy of StringPiece, avoiding naming conflicts. ARPA files can be read in compressed format with these options: * `HAVE_ZLIB` Supports gzip. Link with -lz. * `HAVE_BZLIB` Supports bzip2. Link with -lbz2. * `HAVE_XZLIB` Supports xz. Link with -llzma. Note that these macros impact only `read_compressed.cc` and `read_compressed_test.cc`. The bjam build system will auto-detect bzip2 and xz support. ## Estimation lmplz estimates unpruned language models with modified Kneser-Ney smoothing. After compiling with bjam, run ```bash bin/lmplz -o 5 <text >text.arpa ``` The algorithm is on-disk, using an amount of memory that you specify. See https://kheafield.com/code/kenlm/estimation/ for more. MT Marathon 2012 team members Ivan Pouzyrevsky and Mohammed Mediani contributed to the computation design and early implementation. Jon Clark contributed to the design, clarified points about smoothing, and added logging. ## Filtering filter takes an ARPA or count file and removes entries that will never be queried. The filter criterion can be corpus-level vocabulary, sentence-level vocabulary, or sentence-level phrases. Run ```bash bin/filter ``` and see https://kheafield.com/code/kenlm/filter/ for more documentation. ## Querying Two data structures are supported: probing and trie. Probing is a probing hash table with keys that are 64-bit hashes of n-grams and floats as values. Trie is a fairly standard trie but with bit-level packing so it uses the minimum number of bits to store word indices and pointers. The trie node entries are sorted by word index. Probing is the fastest and uses the most memory. Trie uses the least memory and is a bit slower. As is the custom in language modeling, all probabilities are log base 10. With trie, resident memory is 58% of IRST's smallest version and 21% of SRI's compact version. Simultaneously, trie CPU's use is 81% of IRST's fastest version and 84% of SRI's fast version. KenLM's probing hash table implementation goes even faster at the expense of using more memory. See https://kheafield.com/code/kenlm/benchmark/. Binary format via mmap is supported. Run `./build_binary` to make one then pass the binary file name to the appropriate Model constructor. ## Platforms `murmur_hash.cc` and `bit_packing.hh` perform unaligned reads and writes that make the code architecture-dependent. It has been sucessfully tested on x86\_64, x86, and PPC64. ARM support is reportedly working, at least on the iphone. Runs on Linux, OS X, Cygwin, and MinGW. Hideo Okuma and Tomoyuki Yoshimura from NICT contributed ports to ARM and MinGW. ## Decoder developers - I recommend copying the code and distributing it with your decoder. However, please send improvements upstream. - It's possible to compile the query-only code without Boost, but useful things like estimating models require Boost. - Select the macros you want, listed in the previous section. - There are two build systems: compile.sh and cmake. They're pretty simple and are intended to be reimplemented in your build system. - Use either the interface in `lm/model.hh` or `lm/virtual_interface.hh`. Interface documentation is in comments of `lm/virtual_interface.hh` and `lm/model.hh`. - There are several possible data structures in `model.hh`. Use `RecognizeBinary` in `binary_format.hh` to determine which one a user has provided. You probably already implement feature functions as an abstract virtual base class with several children. I suggest you co-opt this existing virtual dispatch by templatizing the language model feature implementation on the KenLM model identified by `RecognizeBinary`. This is the strategy used in Moses and cdec. - See `lm/config.hh` for run-time tuning options. ## Contributors Contributions to KenLM are welcome. Please base your contributions on https://github.com/kpu/kenlm and send pull requests (or I might give you commit access). Downstream copies in Moses and cdec are maintained by overwriting them so do not make changes there. ## Python module Contributed by Victor Chahuneau. ### Installation ```bash pip install https://github.com/kpu/kenlm/archive/master.zip ``` ### Basic Usage ```python import kenlm model = kenlm.Model('lm/test.arpa') print(model.score('this is a sentence .', bos = True, eos = True)) ``` See [python/example.py](python/example.py) and [python/kenlm.pyx](python/kenlm.pyx) for more, including stateful APIs. --- The name was Hieu Hoang's idea, not mine.