一个无监督的文本分词器和去分词器，用于基于神经网络的文本生成的无监督文本标记器资源-CSDN文库

共239个文件

h：97个

cc：80个

txt：8个

人工智能

自然语言处理

108 浏览量 2023-07-02 11:09:43 上传评论收藏 11.54MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

一个无监督的文本分词器和去分词器，用于基于神经网络的文本生成的无监督文本标记器（239个子文件）

sentencepiece_model.pb.cc 133KB

strutil.cc 86KB

extension_set.cc 80KB

sentencepiece_processor_test.cc 53KB

sentencepiece_processor.cc 37KB

sentencepiece.pb.cc 34KB

unigram_model.cc 33KB

unigram_model_test.cc 32KB

coded_stream.cc 30KB

generated_message_util.cc 29KB

trainer_interface.cc 28KB

wire_format_lite.cc 27KB

structurally_valid.cc 26KB

unigram_model_trainer.cc 21KB

message_lite.cc 20KB

trainer_interface_test.cc 20KB

parse_context.cc 20KB

builder.cc 18KB

normalizer_test.cc 16KB

arena.cc 15KB

model_interface_test.cc 15KB

io_win32.cc 13KB

sentencepiece_trainer_test.cc 13KB

zero_copy_stream_impl_lite.cc 13KB

spm_train_main.cc 13KB

util_test.cc 13KB

normalizer.cc 11KB

zero_copy_stream_impl.cc 11KB

common.cc 11KB

bpe_model_trainer.cc 10KB

time.cc 10KB

sentencepiece_trainer.cc 10KB

bpe_model_test.cc 9KB

stringpiece.cc 9KB

arenastring.cc 9KB

util.cc 7KB

builder_test.cc 7KB

unigram_model_trainer_test.cc 7KB

model_interface.cc 6KB

spm_encode_main.cc 6KB

bpe_model.cc 6KB

int128.cc 6KB

stringprintf.cc 6KB

bytestream.cc 6KB

compile_charsmap_main.cc 6KB

flag.cc 6KB

repeated_field.cc 5KB

init_test.cc 5KB

bpe_model_trainer_test.cc 5KB

generated_message_table_driven_lite.cc 4KB

spm_normalize_main.cc 4KB

status.cc 4KB

error.cc 4KB

spm_decode_main.cc 4KB

generated_enum_util.cc 3KB

filesystem.cc 3KB

char_model_test.cc 3KB

pretokenizer_for_training_test.cc 3KB

implicit_weak_message.cc 3KB

word_model_test.cc 3KB

char_model_trainer_test.cc 2KB

word_model_trainer_test.cc 2KB

zero_copy_stream.cc 2KB

pretokenizer_for_training.cc 2KB

trainer_factory.cc 2KB

word_model_trainer.cc 2KB

statusor.cc 2KB

spm_export_vocab_main.cc 2KB

testharness.cc 2KB

char_model_trainer.cc 2KB

model_factory_test.cc 2KB

trainer_factory_test.cc 2KB

model_factory.cc 2KB

filesystem_test.cc 2KB

unicode_script_test.cc 1KB

char_model.cc 1KB

freelist_test.cc 1KB

unicode_script.cc 1KB

word_model.cc 1KB

test_main.cc 1KB

setup.cfg 40B

ios.toolchain.cmake 46KB

sentencepiece_wrap.cxx 327KB

.gitignore 794B

.gitignore 23B

normalization_rule.h 6.87MB

sentencepiece_model.pb.h 195KB

unicode_script_map.h 104KB

repeated_field.h 99KB

descriptor.h 94KB

wire_format_lite.h 82KB

extension_set.h 77KB

coded_stream.h 68KB

darts.h 51KB

map.h 47KB

sentencepiece.pb.h 40KB

strutil.h 39KB

parse_context.h 32KB

map_type_handler.h 31KB

generated_message_table_driven_lite.h 31KB

共 239 条

# SentencePiece Python Wrapper Python wrapper for SentencePiece. This API will offer the encoding, decoding and training of Sentencepiece. ## Build and Install SentencePiece For Linux (x64/i686), macOS, and Windows(win32/x64) environment, you can simply use pip command to install SentencePiece python module. ``` % pip install sentencepiece ``` To build and install the Python wrapper from source, try the following commands to build and install wheel package. ``` % git clone https://github.com/google/sentencepiece.git % cd sentencepiece % mkdir build % cd build % cmake .. -DSPM_ENABLE_SHARED=OFF -DCMAKE_INSTALL_PREFIX=./root % make install % cd ../python % python setup.py bdist_wheel % pip install dist/sentencepiece*.whl ``` If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try: ``` % python setup.py install --user ``` ## Usage See [this google colab page](https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb) to run sentencepiece interactively. ### Segmentation ``` % python >>> import sentencepiece as spm >>> sp = spm.SentencePieceProcessor(model_file='test/test_model.model') >>> sp.encode('This is a test') [284, 47, 11, 4, 15, 400] >>> sp.encode(['This is a test', 'Hello world'], out_type=int) [[284, 47, 11, 4, 15, 400], [151, 88, 21, 887]] >>> sp.encode_as_ids(['This is a test', 'Hello world']) [[284, 47, 11, 4, 15, 400], [151, 88, 21, 887]] >>> sp.encode('This is a test', out_type=str) ['▁This', '▁is', '▁a', '▁', 't', 'est'] >>> sp.encode(['This is a test', 'Hello world'], out_type=str) [['▁This', '▁is', '▁a', '▁', 't', 'est'], ['▁He', 'll', 'o', '▁world']] >>> sp.encode_as_pieces(['This is a test', 'Hello world']) [['▁This', '▁is', '▁a', '▁', 't', 'est'], ['▁He', 'll', 'o', '▁world']] >>> proto = sp.encode('This is a test', out_type='immutable_proto') >>> for n in proto.pieces: ... print('piece="{}" surface="{}" id={} begin={} end={}'.format(n.piece, n.surface, n.id, n.begin, n.end)) ... piece="▁This" surface="This" id=284 begin=0 end=4 piece="▁is" surface=" is" id=47 begin=4 end=7 piece="▁a" surface=" a" id=11 begin=7 end=9 piece="▁" surface=" " id=4 begin=9 end=10 piece="t" surface="t" id=15 begin=10 end=11 piece="est" surface="est" id=400 begin=11 end=14 >>> [[x.id for x in proto.pieces], [x.piece for x in proto.pieces], [x.begin for x in proto.pieces], [x.end for x in proto.pieces]] [[284, 47, 11, 4, 15, 400], ['▁This', '▁is', '▁a', '▁', 't', 'est'], [0, 4, 7, 9, 10, 11], [4, 7, 9, 10, 11, 14]] >>> proto2 = sp.encode_as_immutable_proto('This is a test') >>> proto2 == proto True >>> for _ in range(10): ... sp.encode('This is a test', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1) ... ['▁', 'This', '▁', 'is', '▁a', '▁', 't', 'e', 'st'] ['▁T', 'h', 'i', 's', '▁is', '▁a', '▁', 'te', 's', 't'] ['▁T', 'h', 'is', '▁', 'is', '▁', 'a', '▁', 't', 'est'] ['▁', 'This', '▁is', '▁', 'a', '▁', 't', 'e', 'st'] ['▁', 'This', '▁', 'is', '▁', 'a', '▁', 't', 'e', 's', 't'] ['▁This', '▁is', '▁a', '▁', 'te', 's', 't'] ['▁This', '▁is', '▁', 'a', '▁', 't', 'e', 'st'] ['▁', 'T', 'h', 'is', '▁', 'is', '▁', 'a', '▁', 'te', 'st'] ['▁', 'This', '▁', 'i', 's', '▁a', '▁', 't', 'e', 'st'] ['▁This', '▁', 'is', '▁a', '▁', 't', 'est'] >> sp.nbest_encode('This is a test', nbest_size=5, out_type=str) [['▁This', '▁is', '▁a', '▁', 't', 'est'], ['▁This', '▁is', '▁a', '▁', 'te', 'st'], ['▁This', '▁is', '▁a', '▁', 'te', 's', 't'], ['▁This', '▁is', '▁a', '▁', 't', 'e', 'st'], ['▁This', '▁is', '▁a', '▁', 't', 'es', 't']] >>> sp.sample_encode_and_score('This is a test', num_samples=5, alpha=0.1, out_type=str, wor=True) [(['▁This', '▁', 'i', 's', '▁a', '▁', 'te', 's', 't'], -3.043105125427246), (['▁This', '▁', 'i', 's', '▁a', '▁', 'te', 'st'], -2.8475849628448486), (['▁', 'This', '▁is', '▁', 'a', '▁', 'te', 'st'], -3.043248176574707), (['▁', 'This', '▁is', '▁a', '▁', 't', 'e', 'st'], -2.87727689743042), (['▁', 'This', '▁', 'i', 's', '▁', 'a', '▁', 't', 'est'], -3.6284031867980957)] >>> sp.decode([284, 47, 11, 4, 15, 400]) 'This is a test' >>> sp.decode([[284, 47, 11, 4, 15, 400], [151, 88, 21, 887]]) ['This is a test', 'Hello world'] >>> proto = sp.decode([284, 47, 11, 4, 15, 400], out_type='immutable_proto') >>> proto.text 'This is a test' >>> sp.decode(['▁', 'This', '▁', 'is', '▁a', '▁', 't', 'e', 'st']) 'This is a test' >>> sp.decode([['▁This', '▁is', '▁a', '▁', 't', 'est'], ['▁He', 'll', 'o', '▁world']]) ['This is a test', 'Hello world'] >>> sp.get_piece_size() 1000 >>> sp.id_to_piece(2) '</s>' >>> sp.id_to_piece([2, 3, 4]) ['</s>', '\r', '▁'] >>> sp.piece_to_id('<s>') 1 >>> sp.piece_to_id(['</s>', '\r', '▁']) [2, 3, 4] >>> len(sp) 1000 >>> sp['</s>'] 2 ``` ### Model Training Training is performed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.train() function. ``` >>> import sentencepiece as spm >>> spm.SentencePieceTrainer.train(input='test/botchan.txt', model_prefix='m', vocab_size=1000, user_defined_symbols=['foo', 'bar']) sentencepiece_trainer.cc(73) LOG(INFO) Starts training with : trainer_spec { input: test/botchan.txt .. snip unigram_model_trainer.cc(500) LOG(INFO) EM sub_iter=1 size=1188 obj=10.2839 num_tokens=32182 num_tokens/piece=27.0892 unigram_model_trainer.cc(500) LOG(INFO) EM sub_iter=0 size=1100 obj=10.4269 num_tokens=33001 num_tokens/piece=30.0009 unigram_model_trainer.cc(500) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4069 num_tokens=33002 num_tokens/piece=30.0018 trainer_interface.cc(595) LOG(INFO) Saving model: m.model trainer_interface.cc(619) LOG(INFO) Saving vocabs: m.vocab >>> ``` ### Training without local filesystem Sentencepiece trainer can receive any iterable object to feed training sentences. You can also pass a file object (instance with write() method) to emit the output model to any devices. These features are useful to run sentencepiece on environment that have limited access to the local file system (e.g., Google colab.) ``` import urllib.request import io import sentencepiece as spm # Loads model from URL as iterator and stores the model to BytesIO. model = io.BytesIO() with urllib.request.urlopen( 'https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt' ) as response: spm.SentencePieceTrainer.train( sentence_iterator=response, model_writer=model, vocab_size=1000) # Serialize the model as file. # with open('out.model', 'wb') as f: # f.write(model.getvalue()) # Directly load the model from serialized model. sp = spm.SentencePieceProcessor(model_proto=model.getvalue()) print(sp.encode('this is test')) ```

评论收藏

内容反馈