Pre-trained ELMo Representations for Many Languages
===================================================
We release our ELMo representations trained on many languages
which helps us win the [CoNLL 2018 shared task on Universal Dependencies Parsing](http://universaldependencies.org/conll18/results.html)
according to LAS.
## Technique Details
We use the same hyperparameter settings as [Peters et al. (2018)](https://arxiv.org/abs/1802.05365) for the biLM
and the character CNN.
We train their parameters
on a set of 20-million-words data randomly
sampled from the raw text released by the shared task (wikidump + common crawl) for each language.
We largely based ourselves on the code of [AllenNLP](https://allennlp.org/), but made the following changes:
* We support unicode characters;
* We use the *sample softmax* technique
to make training on large vocabulary feasible ([Jean et al., 2015](https://arxiv.org/abs/1412.2007)).
However, we use a window of words surrounding the target word
as negative samples and it shows better performance in our preliminary experiments.
The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.
## Downloads
| | | | |
|---|---|---|---|
| [Arabic](http://vectors.nlpl.eu/repository/11/136.zip) | [Bulgarian](http://vectors.nlpl.eu/repository/11/137.zip) | [Catalan](http://vectors.nlpl.eu/repository/11/138.zip) | [Czech](http://vectors.nlpl.eu/repository/11/139.zip) |
| [Old Church Slavonic](http://vectors.nlpl.eu/repository/11/140.zip) | [Danish](http://vectors.nlpl.eu/repository/11/141.zip) | [German](http://vectors.nlpl.eu/repository/11/142.zip) | [Greek](http://vectors.nlpl.eu/repository/11/143.zip) |
| [English](http://vectors.nlpl.eu/repository/11/144.zip) | [Spanish](http://vectors.nlpl.eu/repository/11/145.zip) | [Estonian](http://vectors.nlpl.eu/repository/11/146.zip) | [Basque](http://vectors.nlpl.eu/repository/11/147.zip) |
| [Persian](http://vectors.nlpl.eu/repository/11/148.zip) | [Finnish](http://vectors.nlpl.eu/repository/11/149.zip) | [French](http://vectors.nlpl.eu/repository/11/150.zip) | [Irish](http://vectors.nlpl.eu/repository/11/151.zip) |
| [Galician](http://vectors.nlpl.eu/repository/11/152.zip) | [Ancient Greek](http://vectors.nlpl.eu/repository/11/153.zip) | [Hebrew](http://vectors.nlpl.eu/repository/11/154.zip) | [Hindi](http://vectors.nlpl.eu/repository/11/155.zip) |
| [Croatian](http://vectors.nlpl.eu/repository/11/156.zip) | [Hungarian](http://vectors.nlpl.eu/repository/11/157.zip) | [Indonesian](http://vectors.nlpl.eu/repository/11/158.zip) | [Italian](http://vectors.nlpl.eu/repository/11/159.zip) |
| [Japanese](http://vectors.nlpl.eu/repository/11/160.zip) | [Korean](http://vectors.nlpl.eu/repository/11/161.zip) | [Latin](http://vectors.nlpl.eu/repository/11/162.zip) | [Latvian](http://vectors.nlpl.eu/repository/11/163.zip) |
| [Norwegian Bokmål](http://vectors.nlpl.eu/repository/11/165.zip) | [Dutch](http://vectors.nlpl.eu/repository/11/164.zip) | [Norwegian Nynorsk](http://vectors.nlpl.eu/repository/11/166.zip) | [Polish](http://vectors.nlpl.eu/repository/11/167.zip) |
| [Portuguese](http://vectors.nlpl.eu/repository/11/168.zip) | [Romanian](http://vectors.nlpl.eu/repository/11/169.zip) | [Russian](http://vectors.nlpl.eu/repository/11/170.zip) | [Slovak](http://vectors.nlpl.eu/repository/11/171.zip) |
| [Slovene](http://vectors.nlpl.eu/repository/11/172.zip) | [Swedish](http://vectors.nlpl.eu/repository/11/173.zip) | [Turkish](http://vectors.nlpl.eu/repository/11/174.zip) | [Uyghur](http://vectors.nlpl.eu/repository/11/175.zip) |
| [Ukrainian](http://vectors.nlpl.eu/repository/11/176.zip) | [Urdu](http://vectors.nlpl.eu/repository/11/177.zip) | [Vietnamese](http://vectors.nlpl.eu/repository/11/178.zip) | [Chinese](http://vectors.nlpl.eu/repository/11/179.zip) |
The models are hosted on the [NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home).
**ELMo for Simplified Chinese**
We also provided [simplified-Chinese ELMo](https://pan.baidu.com/s/1RNKnj6hgL-2orQ7f38CauA?errno=0&errmsg=Auth%20Login%20Sucess&&bduss=&ssnerror=0&traceid=) (see [issue 37](https://github.com/HIT-SCIR/ELMoForManyLangs/issues/37)).
It was trained on xinhua proportion of [Chinese gigawords-v5](https://catalog.ldc.upenn.edu/ldc2011t13),
which is different from the Wikipedia for traditional Chinese ELMo.
## Pre-requirements
* **must** python >= 3.6 (if you use python3.5, you will encounter this issue https://github.com/HIT-SCIR/ELMoForManyLangs/issues/8)
* pytorch 0.4
* other requirements from allennlp
## Usage
### Install the package
You need to install the package to use the embeddings with the following commends
```
python setup.py install
```
### Set up the `config_path`
After unzip the model, you will find a JSON file `${lang}.model/config.json`.
Please change the `"config_path"` field to the relative path to
the model configuration `cnn_50_100_512_4096_sample.json`.
For example, if your ELMo model is `zht.model/config.json` and your model configuration
is `zht.model/cnn_50_100_512_4096_sample.json`, you need to change `"config_path"`
in `zht.model/config.json` to `cnn_50_100_512_4096_sample.json`.
If there is no configuration `cnn_50_100_512_4096_sample.json` under `${lang}.model`,
you can copy the `configs/cnn_50_100_512_4096_sample.json` into `${lang}.model`,
or change the `"config_path"` into `configs/cnn_50_100_512_4096_sample.json`.
See [issue 27](https://github.com/HIT-SCIR/ELMoForManyLangs/issues/27) for more details.
### Use ELMoForManyLangs in command line
Prepare your input file in the [conllu format](http://universaldependencies.org/format.html), like
```
1 Sue Sue _ _ _ _ _ _ _
2 likes like _ _ _ _ _ _ _
3 coffee coffee _ _ _ _ _ _ _
4 and and _ _ _ _ _ _ _
5 Bill Bill _ _ _ _ _ _ _
6 tea tea _ _ _ _ _ _ _
```
Fileds should be separated by `'\t'`. We only use the second column and space (`' '`) is supported in
this field (for Vietnamese, a word can contains spaces).
Do remember tokenization!
When it's all set, run
```
$ python -m elmoformanylangs test \
--input_format conll \
--input /path/to/your/input \
--model /path/to/your/model \
--output_prefix /path/to/your/output \
--output_format hdf5 \
--output_layer -1
```
It will dump an hdf5 encoded `dict` onto the disk, where the key is `'\t'` separated
words in the sentence and the value is it's 3-layer averaged ELMo representation.
You can also dump the cnn encoded word with `--output_layer 0`,
the first layer of the LsTM with `--output_layer 1` and the second layer
of the LSTM with `--output_layer 2`.
We are actively changing the interface to make it more adapted to the
AllenNLP ELMo and more programmatically friendly.
### Use ELMoForManyLangs programmatically
Thanks @voidism for contributing the API.
By using `Embedder` python object, you can use ELMo into your own code like this:
```python
from elmoformanylangs import Embedder
e = Embedder('/path/to/your/model/')
sents = [['今', '天', '天氣', '真', '好', '阿'],
['潮水', '退', '了', '就', '知道', '誰', '沒', '穿', '褲子']]
# the list of lists which store the sentences
# after segment if necessary.
e.sents2elmo(sents)
# will return a list of numpy arrays
# each with the shape=(seq_len, embedding_size)
```
#### the parameters to init Embedder:
```python
class Embedder(model_dir='/path/to/your/model/', batch_size=64):
```
- **model_dir**: the absolute path from the repo top dir to you model dir.
- **batch_size**: the batch_size you want when the model inference, you can specify it properly according to your gpu/cpu ram size. (default: 64)
#### the parameters of the function sents2elmo:
```python
def sents2elmo(sents, output_layer=-1):
```
- **sents**: the list of lists which store the sentences after segment if necessary.
- **output_layer**: the target layer to output.
- 0
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
1.ELMo_Chinese_text_classifier.zip (94个子文件)
1.ELMo_Chinese_text_classifier
ELMoForManyLangs
setup.py 375B
.gitignore 1KB
.git
info
exclude 250B
objects
pack
pack-f27df9a1e43ce577a82f2d62bbffa9925112f857.idx 6KB
pack-f27df9a1e43ce577a82f2d62bbffa9925112f857.pack 87KB
info
HEAD 23B
description 73B
packed-refs 114B
branches
config 318B
index 2KB
refs
tags
remotes
origin
HEAD 32B
heads
master 41B
hooks
commit-msg.sample 896B
pre-receive.sample 544B
fsmonitor-watchman.sample 3KB
pre-rebase.sample 5KB
prepare-commit-msg.sample 1KB
update.sample 4KB
pre-push.sample 1KB
pre-commit.sample 2KB
post-update.sample 189B
applypatch-msg.sample 478B
pre-applypatch.sample 424B
logs
HEAD 191B
refs
remotes
origin
HEAD 191B
heads
master 191B
build
bdist.win-amd64
lib
elmoformanylangs
__init__.py 49B
dataloader.py 1KB
elmo.py 8KB
biLM.py 25KB
utils.py 428B
frontend.py 7KB
__main__.py 10KB
modules
lstm_cell_with_projection.py 13KB
__init__.py 0B
elmo.py 9KB
highway.py 3KB
token_embedder.py 5KB
util.py 9KB
embedding_layer.py 2KB
encoder_base.py 16KB
classify_layer.py 9KB
lstm.py 1KB
dist
elmoformanylangs-0.0.2-py3.6.egg 82KB
README.md 10KB
configs
cnn_0_100_512_4096_sample.json 474B
cnn_50_100_512_4096_sample.json 474B
elmoformanylangs
__init__.py 49B
dataloader.py 1KB
elmo.py 8KB
biLM.py 25KB
utils.py 428B
frontend.py 7KB
__main__.py 10KB
modules
lstm_cell_with_projection.py 13KB
__init__.py 0B
elmo.py 9KB
highway.py 3KB
token_embedder.py 5KB
util.py 9KB
embedding_layer.py 2KB
encoder_base.py 16KB
classify_layer.py 9KB
lstm.py 1KB
elmoformanylangs.egg-info
top_level.txt 17B
SOURCES.txt 808B
PKG-INFO 344B
dependency_links.txt 1B
requires.txt 27B
ELMo_text_classification.py 4KB
.ipynb_checkpoints
ELMo_text_classification-checkpoint.ipynb 9KB
zhs.model
encoder.pkl 288.25MB
token_embedder.pkl 98.94MB
.DS_Store 6KB
word.dic 970KB
char.dic 53KB
config.json 479B
ELMo_model.h5 18.07MB
.DS_Store 8KB
utils.py 4KB
configs
cnn_50_100_512_4096_sample.json 474B
.idea
1.ELMo_Chinese_text_classifier.iml 464B
misc.xml 294B
workspace.xml 15KB
vcs.xml 202B
deployment.xml 370B
inspectionProfiles
modules.xml 319B
__pycache__
utils.cpython-36.pyc 5KB
processed_data
technology.txt 4.42MB
sports.txt 3.72MB
entertainment.txt 4.09MB
car.txt 2.39MB
military.txt 3.06MB
ELMo_text_classification.ipynb 25KB
共 94 条
- 1
资源评论
vivian_ll
- 粉丝: 619
- 资源: 6
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功