基于ELMo词向量的textCNN中文文本分类python代码

共94个文件

py：37个

sample：11个

txt：9个

NLP

python

文本分类

ELMo

pytorch

需积分: 43 191 浏览量 2020-06-22 13:01:29 上传评论 4 收藏 383.46MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

1.ELMo_Chinese_text_classifier.zip （94个子文件）

1.ELMo_Chinese_text_classifier

ELMoForManyLangs

setup.py 375B

.gitignore 1KB

.git

info

exclude 250B

objects

pack

pack-f27df9a1e43ce577a82f2d62bbffa9925112f857.idx 6KB

pack-f27df9a1e43ce577a82f2d62bbffa9925112f857.pack 87KB

info

HEAD 23B

description 73B

packed-refs 114B

branches

config 318B

index 2KB

refs

tags

remotes

origin

HEAD 32B

heads

master 41B

hooks

commit-msg.sample 896B

pre-receive.sample 544B

fsmonitor-watchman.sample 3KB

pre-rebase.sample 5KB

prepare-commit-msg.sample 1KB

update.sample 4KB

pre-push.sample 1KB

pre-commit.sample 2KB

post-update.sample 189B

applypatch-msg.sample 478B

pre-applypatch.sample 424B

logs

HEAD 191B

refs

remotes

origin

HEAD 191B

heads

master 191B

build

bdist.win-amd64

lib

elmoformanylangs

__init__.py 49B

dataloader.py 1KB

elmo.py 8KB

biLM.py 25KB

utils.py 428B

frontend.py 7KB

__main__.py 10KB

modules

lstm_cell_with_projection.py 13KB

__init__.py 0B

elmo.py 9KB

highway.py 3KB

token_embedder.py 5KB

util.py 9KB

embedding_layer.py 2KB

encoder_base.py 16KB

classify_layer.py 9KB

lstm.py 1KB

dist

elmoformanylangs-0.0.2-py3.6.egg 82KB

README.md 10KB

configs

cnn_0_100_512_4096_sample.json 474B

cnn_50_100_512_4096_sample.json 474B

elmoformanylangs

__init__.py 49B

dataloader.py 1KB

elmo.py 8KB

biLM.py 25KB

utils.py 428B

frontend.py 7KB

__main__.py 10KB

modules

lstm_cell_with_projection.py 13KB

__init__.py 0B

elmo.py 9KB

highway.py 3KB

token_embedder.py 5KB

util.py 9KB

embedding_layer.py 2KB

encoder_base.py 16KB

classify_layer.py 9KB

lstm.py 1KB

elmoformanylangs.egg-info

top_level.txt 17B

SOURCES.txt 808B

PKG-INFO 344B

dependency_links.txt 1B

requires.txt 27B

ELMo_text_classification.py 4KB

.ipynb_checkpoints

ELMo_text_classification-checkpoint.ipynb 9KB

zhs.model

encoder.pkl 288.25MB

token_embedder.pkl 98.94MB

.DS_Store 6KB

word.dic 970KB

char.dic 53KB

config.json 479B

ELMo_model.h5 18.07MB

.DS_Store 8KB

utils.py 4KB

configs

cnn_50_100_512_4096_sample.json 474B

.idea

1.ELMo_Chinese_text_classifier.iml 464B

misc.xml 294B

workspace.xml 15KB

vcs.xml 202B

deployment.xml 370B

inspectionProfiles

modules.xml 319B

__pycache__

utils.cpython-36.pyc 5KB

processed_data

technology.txt 4.42MB

sports.txt 3.72MB

entertainment.txt 4.09MB

car.txt 2.39MB

military.txt 3.06MB

ELMo_text_classification.ipynb 25KB

Pre-trained ELMo Representations for Many Languages =================================================== We release our ELMo representations trained on many languages which helps us win the [CoNLL 2018 shared task on Universal Dependencies Parsing](http://universaldependencies.org/conll18/results.html) according to LAS. ## Technique Details We use the same hyperparameter settings as [Peters et al. (2018)](https://arxiv.org/abs/1802.05365) for the biLM and the character CNN. We train their parameters on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl) for each language. We largely based ourselves on the code of [AllenNLP](https://allennlp.org/), but made the following changes: * We support unicode characters; * We use the *sample softmax* technique to make training on large vocabulary feasible ([Jean et al., 2015](https://arxiv.org/abs/1412.2007)). However, we use a window of words surrounding the target word as negative samples and it shows better performance in our preliminary experiments. The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU. ## Downloads | | | | | |---|---|---|---| | [Arabic](http://vectors.nlpl.eu/repository/11/136.zip) | [Bulgarian](http://vectors.nlpl.eu/repository/11/137.zip) | [Catalan](http://vectors.nlpl.eu/repository/11/138.zip) | [Czech](http://vectors.nlpl.eu/repository/11/139.zip) | | [Old Church Slavonic](http://vectors.nlpl.eu/repository/11/140.zip) | [Danish](http://vectors.nlpl.eu/repository/11/141.zip) | [German](http://vectors.nlpl.eu/repository/11/142.zip) | [Greek](http://vectors.nlpl.eu/repository/11/143.zip) | | [English](http://vectors.nlpl.eu/repository/11/144.zip) | [Spanish](http://vectors.nlpl.eu/repository/11/145.zip) | [Estonian](http://vectors.nlpl.eu/repository/11/146.zip) | [Basque](http://vectors.nlpl.eu/repository/11/147.zip) | | [Persian](http://vectors.nlpl.eu/repository/11/148.zip) | [Finnish](http://vectors.nlpl.eu/repository/11/149.zip) | [French](http://vectors.nlpl.eu/repository/11/150.zip) | [Irish](http://vectors.nlpl.eu/repository/11/151.zip) | | [Galician](http://vectors.nlpl.eu/repository/11/152.zip) | [Ancient Greek](http://vectors.nlpl.eu/repository/11/153.zip) | [Hebrew](http://vectors.nlpl.eu/repository/11/154.zip) | [Hindi](http://vectors.nlpl.eu/repository/11/155.zip) | | [Croatian](http://vectors.nlpl.eu/repository/11/156.zip) | [Hungarian](http://vectors.nlpl.eu/repository/11/157.zip) | [Indonesian](http://vectors.nlpl.eu/repository/11/158.zip) | [Italian](http://vectors.nlpl.eu/repository/11/159.zip) | | [Japanese](http://vectors.nlpl.eu/repository/11/160.zip) | [Korean](http://vectors.nlpl.eu/repository/11/161.zip) | [Latin](http://vectors.nlpl.eu/repository/11/162.zip) | [Latvian](http://vectors.nlpl.eu/repository/11/163.zip) | | [Norwegian Bokmål](http://vectors.nlpl.eu/repository/11/165.zip) | [Dutch](http://vectors.nlpl.eu/repository/11/164.zip) | [Norwegian Nynorsk](http://vectors.nlpl.eu/repository/11/166.zip) | [Polish](http://vectors.nlpl.eu/repository/11/167.zip) | | [Portuguese](http://vectors.nlpl.eu/repository/11/168.zip) | [Romanian](http://vectors.nlpl.eu/repository/11/169.zip) | [Russian](http://vectors.nlpl.eu/repository/11/170.zip) | [Slovak](http://vectors.nlpl.eu/repository/11/171.zip) | | [Slovene](http://vectors.nlpl.eu/repository/11/172.zip) | [Swedish](http://vectors.nlpl.eu/repository/11/173.zip) | [Turkish](http://vectors.nlpl.eu/repository/11/174.zip) | [Uyghur](http://vectors.nlpl.eu/repository/11/175.zip) | | [Ukrainian](http://vectors.nlpl.eu/repository/11/176.zip) | [Urdu](http://vectors.nlpl.eu/repository/11/177.zip) | [Vietnamese](http://vectors.nlpl.eu/repository/11/178.zip) | [Chinese](http://vectors.nlpl.eu/repository/11/179.zip) | The models are hosted on the [NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home). **ELMo for Simplified Chinese** We also provided [simplified-Chinese ELMo](https://pan.baidu.com/s/1RNKnj6hgL-2orQ7f38CauA?errno=0&errmsg=Auth%20Login%20Sucess&&bduss=&ssnerror=0&traceid=) (see [issue 37](https://github.com/HIT-SCIR/ELMoForManyLangs/issues/37)). It was trained on xinhua proportion of [Chinese gigawords-v5](https://catalog.ldc.upenn.edu/ldc2011t13), which is different from the Wikipedia for traditional Chinese ELMo. ## Pre-requirements * **must** python >= 3.6 (if you use python3.5, you will encounter this issue https://github.com/HIT-SCIR/ELMoForManyLangs/issues/8) * pytorch 0.4 * other requirements from allennlp ## Usage ### Install the package You need to install the package to use the embeddings with the following commends ``` python setup.py install ``` ### Set up the `config_path` After unzip the model, you will find a JSON file `${lang}.model/config.json`. Please change the `"config_path"` field to the relative path to the model configuration `cnn_50_100_512_4096_sample.json`. For example, if your ELMo model is `zht.model/config.json` and your model configuration is `zht.model/cnn_50_100_512_4096_sample.json`, you need to change `"config_path"` in `zht.model/config.json` to `cnn_50_100_512_4096_sample.json`. If there is no configuration `cnn_50_100_512_4096_sample.json` under `${lang}.model`, you can copy the `configs/cnn_50_100_512_4096_sample.json` into `${lang}.model`, or change the `"config_path"` into `configs/cnn_50_100_512_4096_sample.json`. See [issue 27](https://github.com/HIT-SCIR/ELMoForManyLangs/issues/27) for more details. ### Use ELMoForManyLangs in command line Prepare your input file in the [conllu format](http://universaldependencies.org/format.html), like ``` 1 Sue Sue _ _ _ _ _ _ _ 2 likes like _ _ _ _ _ _ _ 3 coffee coffee _ _ _ _ _ _ _ 4 and and _ _ _ _ _ _ _ 5 Bill Bill _ _ _ _ _ _ _ 6 tea tea _ _ _ _ _ _ _ ``` Fileds should be separated by `'\t'`. We only use the second column and space (`' '`) is supported in this field (for Vietnamese, a word can contains spaces). Do remember tokenization! When it's all set, run ``` $ python -m elmoformanylangs test \ --input_format conll \ --input /path/to/your/input \ --model /path/to/your/model \ --output_prefix /path/to/your/output \ --output_format hdf5 \ --output_layer -1 ``` It will dump an hdf5 encoded `dict` onto the disk, where the key is `'\t'` separated words in the sentence and the value is it's 3-layer averaged ELMo representation. You can also dump the cnn encoded word with `--output_layer 0`, the first layer of the LsTM with `--output_layer 1` and the second layer of the LSTM with `--output_layer 2`. We are actively changing the interface to make it more adapted to the AllenNLP ELMo and more programmatically friendly. ### Use ELMoForManyLangs programmatically Thanks @voidism for contributing the API. By using `Embedder` python object, you can use ELMo into your own code like this: ```python from elmoformanylangs import Embedder e = Embedder('/path/to/your/model/') sents = [['今', '天', '天氣', '真', '好', '阿'], ['潮水', '退', '了', '就', '知道', '誰', '沒', '穿', '褲子']] # the list of lists which store the sentences # after segment if necessary. e.sents2elmo(sents) # will return a list of numpy arrays # each with the shape=(seq_len, embedding_size) ``` #### the parameters to init Embedder: ```python class Embedder(model_dir='/path/to/your/model/', batch_size=64): ``` - **model_dir**: the absolute path from the repo top dir to you model dir. - **batch_size**: the batch_size you want when the model inference, you can specify it properly according to your gpu/cpu ram size. (default: 64) #### the parameters of the function sents2elmo: ```python def sents2elmo(sents, output_layer=-1): ``` - **sents**: the list of lists which store the sentences after segment if necessary. - **output_layer**: the target layer to output. - 0

评论收藏

内容反馈