一些经常需要用到的NLP算法包，有助于学习和使用基于深度学习的文本处理。.zip资源-CSDN文库

共242个文件

py：199个

txt：17个

md：15个

毕业设计

课程设计

项目课程

资源资料

146 浏览量 2024-02-15 15:46:54 上传评论收藏 4.88MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

一些经常需要用到的NLP算法包，有助于学习和使用基于深度学习的文本处理。.zip （242个子文件）

.gitignore 1KB

.gitignore 223B

train.json 2.16MB

vocab.json 1.85MB

dev.json 278KB

test.json 172KB

train.json 48KB

dev.json 25KB

test.json 23KB

LICENSE 1KB

README.md 13KB

README.md 12KB

quickstart.md 8KB

embeddings.md 4KB

index.md 4KB

README.md 3KB

README.md 2KB

autodoc.md 2KB

changelog.md 2KB

faq.md 2KB

README.md 309B

maxsum.md 62B

mmr.md 52B

keybert.md 39B

modeling_xlnet.py 71KB

bert_modeling.py 60KB

modeling_bert.py 58KB

modeling_albert.py 58KB

tokenization_utils.py 54KB

modeling_albert_bright.py 52KB

modeling_xlm.py 44KB

modeling_utils.py 42KB

modeling_transfo_xl.py 39KB

tokenization_xlm.py 36KB

modeling_auto.py 36KB

modeling_distilbert.py 34KB

modeling_gpt2.py 32KB

modeling_openai.py 30KB

modeling_roberta.py 25KB

modeling_ctrl.py 23KB

tokenization_bert.py 22KB

tokenization_transfo_xl.py 21KB

lr_scheduler.py 21KB

crf.py 20KB

lr_finder.py 18KB

bert_tokenization.py 17KB

batcher.py 17KB

ner_train.py 14KB

modeling_transfo_xl_utilities.py 13KB

bert_optimization.py 13KB

data.py 13KB

get_file.py 13KB

get_file.py 12KB

_model.py 12KB

dutils.py 12KB

common.py 12KB

ner_seq.py 11KB

ner_span.py 11KB

train.py 11KB

file_utils.py 11KB

tokenization_albert.py 11KB

configuration_utils.py 11KB

tfidf.py 10KB

tokenization_xlnet.py 10KB

model.py 10KB

tokenization_gpt2.py 10KB

gpt_train.py 9KB

futils.py 9KB

train.py 9KB

semantic_search.py 8KB

configuration_xlm.py 8KB

configuration_auto.py 8KB

utils_ner.py 8KB

adafactor.py 8KB

tokenization_openai.py 8KB

decode.py 8KB

tokenization_auto.py 7KB

ner_predict.py 7KB

run.py 7KB

__main__.py 7KB

tokenization_ctrl.py 7KB

gpt_predict.py 7KB

configuration_transfo_xl.py 7KB

configuration_xlnet.py 7KB

language_model_generate.py 7KB

configuration_bert.py 7KB

eda.py 6KB

beam_search_decode.py 6KB

tokenization_roberta.py 6KB

distance.py 6KB

albert_for_ner.py 6KB

bert_for_ner.py 6KB

word2vec.py 6KB

config.py 6KB

configuration_gpt2.py 6KB

ngram.py 6KB

adabound.py 6KB

configuration_ctrl.py 6KB

共 242 条

# text2vec 1. 包含sentence-bert，腾讯Word2Vec词向量方法； 2. 词向量化表示，句子向量化表示，长文本向量化表示，文本相似度计算。 **Guide** - [Feature](#Feature) - [Install](#install) - [Usage](#usage) - [Reference](#reference) # Feature #### 文本向量表示 - 字词粒度，通过腾讯AI Lab开源的大规模高质量中文[词向量数据（800万中文词轻量版）](https://pan.baidu.com/s/1La4U4XNFe8s5BJqxPQpeiQ) (文件名：light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe），获取字词的word2vec向量表示。 - 句子粒度，通过求句子中所有单词词向量的平均值计算得到。 - 篇章粒度，可以通过gensim库的doc2vec得到，应用较少，本项目不实现。 #### 文本相似度计算 - 基准方法，估计两句子间语义相似度最简单的方法就是求句子中所有单词词向量的平均值，然后计算两句子词向量之间的余弦相似性。 - 词移距离（Word Mover’s Distance），词移距离使用两文本间的词向量，测量其中一文本中的单词在语义空间中移动到另一文本单词所需要的最短距离。 #### query和docs的相似度比较 - rank_bm25方法，使用bm25的变种算法，对query和文档之间的相似度打分，得到docs的rank排序。 - semantic_search方法，使用cosine similarty + topk高效计算，比一对一暴力计算快一个数量级。 ## 调研结论 #### 文本相似度计算 - 基准方法（Word2Vec + Cosine）尽管文本相似度计算的基准方法很简洁，但用平均词向量之间求余弦相似度的表现非常好。实验有以下结论： 1. 简单word2vec向量比GloVe向量表现的好 2. 在用word2vec时，尚不清楚使用停用词表或TF-IDF加权是否更有帮助。在STS数据集上，有一点儿帮助；在SICK上没有帮助。仅计算未加权的所有word2vec向量平均值表现得很好。 3. 在使用GloVe时，停用词列表对于达到好的效果非常重要。利用TF-IDF加权没有帮助。 ![基准方法效果很好](./docs/base1.jpg) - 词移距离（WMD）基于我们的结果，好像没有什么使用词移距离的必要了，因为上述方法表现得已经很好了。只有在STS-TEST数据集上，而且只有在有停止词列表的情况下，词移距离才能和简单基准方法一较高下。 ![词移距离的表现令人失望](./docs/move1.jpg) - 预训练语言模型（SentenceBERT）对SentenceBERT系列模型经过fine-tune和多语言迁移，可以embedding表示多语言的，长达128个字符的句子。 `paraphrase-multilingual-MiniLM-L12-v2`是`paraphrase-MiniLM-L6-v2`模型的多语言版本，速度快，效果好，支持中文，text2vec默认使用transformers库调用该模型`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`。大家也可以通过sentence-transformers库调用以下SentenceBERT系列模型，具体见[https://github.com/UKPLab/sentence-transformers](https://github.com/UKPLab/sentence-transformers) | Model Name | STSb | DupQ | TwitterP | SciDocs | Clustering | Avg. Performance | Speed | | :------- | :--------- | :--------- | :---------: | :---------: | :---------: | :---------: | :---------: | | paraphrase-mpnet-base-v2 | 86.99 | 87.80 | 76.05 | 80.57 | 52.81 | 76.84 | 2800 | | paraphrase-multilingual-mpnet-base-v2 | 86.82 | 87.50 | 76.52 | 78.66 | 47.46 | 75.39 | 2500 | | paraphrase-TinyBERT-L6-v2 | 84.91 | 86.93 | 75.39 | 81.51 | 48.04 | 75.36 | 4500 | | paraphrase-distilroberta-base-v2 | 85.37 | 86.97 | 73.96 | 80.25 | 49.18 | 75.15 | 4000 | | paraphrase-MiniLM-L12-v2 | 84.41 | 87.28 | 75.34 | 80.08 | 46.95 | 74.81 | 7500 | | paraphrase-MiniLM-L6-v2 | 84.12 | 87.23 | 76.32 | 78.91 | 45.34 | 74.38 | 14200 | | paraphrase-multilingual-MiniLM-L12-v2 | 84.42 | 87.52 | 74.94 | 78.27 | 43.87 | 73.80 | 7500 | | paraphrase-MiniLM-L3-v2 | 82.41 | 88.09 | 76.14 | 77.71 | 43.39 | 73.55 | 19000 | | distiluse-base-multilingual-cased-v2 | 80.75 | 83.52 | 76.26 | 70.39 | 37.03 | 69.59 | 4000 | | average_word_embeddings_glove.6B.300d | 61.77 | 78.07 | 68.60 | 63.69 | 30.46 | 60.52 | 34000 | # Install ``` pip3 install -U text2vec ``` or ``` git clone https://github.com/shibing624/text2vec.git cd text2vec python3 setup.py install ``` # Usage 1. 计算文本向量 - 基于`pretrained model`计算文本向量 > `SBert`通过预训练的`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`模型计算句子向量 > `Word2Vec`通过腾讯词向量`Tencent_AILab_ChineseEmbedding.tar.gz`计算各字词的词向量，句子向量通过单词词向量取平均值得到示例[computing_embeddings.py](../examples/computing_embeddings.py) ```python import sys sys.path.append('..') from Text2Vec import SBert def compute_emb(model): # Embed a list of sentences sentences = ['卡', '银行卡', '如何更换花呗绑定银行卡', '花呗更改绑定银行卡', 'This framework generates embeddings for each input sentence', 'Sentences are passed as a list of string.', 'The quick brown fox jumps over the lazy dog.'] sentence_embeddings = model.encode(sentences) print(type(sentence_embeddings), sentence_embeddings.shape) # The result is a list of sentence embeddings as numpy arrays for sentence, embedding in zip(sentences, sentence_embeddings): print("Sentence:", sentence) print("Embedding:", embedding) print("") sbert_model = SBert('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2') compute_emb(sbert_model) ``` output: ``` <class 'numpy.ndarray'> (7, 384) Sentence: 卡 Embedding: [ 1.39491949e-02 8.62287879e-02 -1.35622978e-01 ... ] Sentence: 银行卡 Embedding: [ 0.06216322 0.2731747 -0.6912158 ... ] ``` 返回值`embeddings`是`numpy.ndarray`类型，shape为`(sentence_size, model_embedding_size)` > `paraphrase-multilingual-MiniLM-L12-v2`是`sentence-bert`预训练模型，Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 50+ languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. 模型自动下载到本机路径：`~/.cache/torch/sentence_transformers/` > `w2v-light-tencent-chinese`是`Word2Vec`的轻量版腾讯词向量模型，模型自动下载到本机路径：`~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin` - 预训练词向量模型以下提供两种`Word2Vec`词向量，任选一个： - 轻量版腾讯词向量 [百度云盘-密码:tawe](https://pan.baidu.com/s/1La4U4XNFe8s5BJqxPQpeiQ) 或 [谷歌云盘](https://drive.google.com/u/0/uc?id=1iQo9tBb2NgFOBxx0fA16AZpSgc-bG_Rp&export=download)，二进制，运行程序，自动下载到 `~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin` - [腾讯词向量-官方全量](https://ai.tencent.com/ailab/nlp/zh/data/Tencent_AILab_ChineseEmbedding.tar.gz), 6.78G放到： `~/.text2vec/datasets/Tencent_AILab_ChineseEmbedding.txt`，腾讯词向量主页：https://ai.tencent.com/ailab/nlp/zh/embedding.html 词向量下载地址：https://ai.tencent.com/ailab/nlp/zh/data/Tencent_AILab_ChineseEmbedding.tar.gz 更多查看[腾讯词向量介绍-wiki](https://github.com/shibing624/text2vec/wiki/%E8%85%BE%E8%AE%AF%E8%AF%8D%E5%90%91%E9%87%8F%E4%BB%8B%E7%BB%8D) 2. 计算句子之间的相似度值示例[semantic_text_similarity.py](../examples/semantic_text_similarity.py) ```python import sys sys.path.append('..') from Text2Vec import Similarity # Two lists of sentences sentences1 = ['如何更换花呗绑定银行卡', 'The cat sits outside', 'A man is playing guitar', 'The new movie is awesome'] sentences2 = ['花呗更改绑定银行卡', 'The dog plays in the garden', 'A woman watches TV',

评论收藏

内容反馈