Subword Neural Machine Translation
==================================
This repository contains preprocessing scripts to segment text into subword
units. The primary purpose is to facilitate the reproduction of our experiments
on Neural Machine Translation with subword units (see below for reference).
INSTALLATION
------------
install via pip (from PyPI):
pip install subword-nmt
install via pip (from Github):
pip install https://github.com/rsennrich/subword-nmt/archive/master.zip
alternatively, clone this repository; the scripts are executable stand-alone.
USAGE INSTRUCTIONS
------------------
Check the individual files for usage instructions.
To apply byte pair encoding to word segmentation, invoke these commands:
subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}
To segment rare words into character n-grams, do the following:
subword-nmt get-vocab --train_file {train_file} --vocab_file {vocab_file}
subword-nmt segment-char-ngrams --vocab {vocab_file} -n {order} --shortlist {size} < {test_file} > {out_file}
The original segmentation can be restored with a simple replacement:
sed -r 's/(@@ )|(@@ ?$)//g'
If you cloned the repository and did not install a package, you can also run the individual commands as scripts:
./subword_nmt/learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT
--------------------------------------------------
We found that for languages that share an alphabet, learning BPE on the
concatenation of the (two or more) involved languages increases the consistency
of segmentation, and reduces the problem of inserting/deleting characters when
copying/transliterating names.
However, this introduces undesirable edge cases in that a word may be segmented
in a way that has only been observed in the other language, and is thus unknown
at test time. To prevent this, `apply_bpe.py` accepts a `--vocabulary` and a
`--vocabulary-threshold` option so that the script will only produce symbols
which also appear in the vocabulary (with at least some frequency).
To use this functionality, we recommend the following recipe (assuming L1 and L2
are the two languages):
Learn byte pair encoding on the concatenation of the training text, and get resulting vocabulary for each:
cat {train_file}.L1 {train_file}.L2 | subword-nmt learn-bpe -s {num_operations} -o {codes_file}
subword-nmt apply-bpe -c {codes_file} < {train_file}.L1 | subword-nmt get-vocab > {vocab_file}.L1
subword-nmt apply-bpe -c {codes_file} < {train_file}.L2 | subword-nmt get-vocab > {vocab_file}.L2
more conventiently, you can do the same with with this command:
subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2
re-apply byte pair encoding with vocabulary filter:
subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {train_file}.L1 > {train_file}.BPE.L1
subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 < {train_file}.L2 > {train_file}.BPE.L2
as a last step, extract the vocabulary to be used by the neural network. Example with Nematus:
nematus/data/build_dictionary.py {train_file}.BPE.L1 {train_file}.BPE.L2
[you may want to take the union of all vocabularies to support multilingual systems]
for test/dev data, re-use the same options for consistency:
subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {test_file}.L1 > {test_file}.BPE.L1
ADVANCED FEATURES
-----------------
On top of the basic BPE implementation, this repository supports:
- BPE dropout (Provilkov, Emelianenko and Voita, 2019): https://arxiv.org/abs/1910.13267
use the argument `--dropout 0.1` for `subword-nmt apply-bpe` to randomly drop out possible merges.
Doing this on the training corpus can improve quality of the final system; at test time, use BPE without dropout.
In order to obtain reproducible results, argument `--seed` can be used to set the random seed.
**Note:** In the original paper, the authors used BPE-Dropout on each new batch separately. You can copy the training corpus several times to get similar behavior to obtain multiple segmentations for the same sentence.
- support for glossaries:
use the argument `--glossaries` for `subword-nmt apply-bpe` to provide a list of subwords and/or regular expressions
that should always be passed to the output without subword segmentation
```
echo "I am flying to <country>Switzerland</country> at noon ." | subword-nmt apply-bpe --codes subword_nmt/tests/data/bpe.ref
I am fl@@ y@@ ing to <@@ coun@@ tr@@ y@@ >@@ S@@ w@@ it@@ z@@ er@@ l@@ and@@ <@@ /@@ coun@@ tr@@ y@@ > at no@@ on .
echo "I am flying to <country>Switzerland</country> at noon ." | subword-nmt apply-bpe --codes subword_nmt/tests/data/bpe.ref --glossaries "<country>\w*</country>" "fly"
I am fly@@ ing to <country>Switzerland</country> at no@@ on .
```
PUBLICATIONS
------------
The segmentation methods are described in:
Rico Sennrich, Barry Haddow and Alexandra Birch (2016):
Neural Machine Translation of Rare Words with Subword Units
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
HOW IMPLEMENTATION DIFFERS FROM Sennrich et al. (2016)
------------------------------------------------------
This repository implements the subword segmentation as described in Sennrich et al. (2016),
but since version 0.2, there is one core difference related to end-of-word tokens.
In Sennrich et al. (2016), the end-of-word token `</w>` is initially represented as a separate token, which can be merged with other subwords over time:
```
u n d </w>
f u n d </w>
```
Since 0.2, end-of-word tokens are initially concatenated with the word-final character:
```
u n d</w>
f u n d</w>
```
The new representation ensures that when BPE codes are learned from the above examples and then applied to new text, it is clear that a subword unit `und` is unambiguously word-final, and `un` is unambiguously word-internal, preventing the production of up to two different subword units from each BPE merge operation.
`apply_bpe.py` is backward-compatible and continues to accept old-style BPE files. New-style BPE files are identified by having the following first line: `#version: 0.2`
ACKNOWLEDGMENTS
---------------
This project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
自然语言处理课程大作业95分以上项目代码+文档资料(完整项目代码).zip,个人大三学期的期末大作业、经导师指导并认可通过的高分大作业设计项目,评审分98分。主要针对计算机相关专业的正在做大作业的学生和需要项目实战练习的学习者,可作为课程设计、期末大作业。 自然语言处理课程大作业95分以上项目代码+文档资料(完整项目代码).zip,自然语言处理课程大作业95分以上项目代码+文档资料(完整项目代码).zip自然语言处理课程大作业95分以上项目代码+文档资料(完整项目代码).zip自然语言处理课程大作业95分以上项目代码+文档资料(完整项目代码).zip自然语言处理课程大作业95分以上项目代码+文档资料(完整项目代码).zip自然语言处理课程大作业95分以上项目代码+文档资料(完整项目代码).zip自然语言处理课程大作业95分以上项目代码+文档资料(完整项目代码).zip自然语言处理课程大作业95分以上项目代个人大三学期的期末大作业、经导师指导并认可通过的高分大作业设计项目,评审分98分。主要针对计算机相关专业的正在做大作业的学生和需要项目实战练习的学习者,可作为课程设计、期末大作业。
资源推荐
资源详情
资源评论
收起资源包目录
自然语言处理课程大作业95分以上项目代码+文档资料(完整项目代码).zip (275个子文件)
ru.train.r0.125 4.31MB
en.train.r0.125 2.51MB
tr.train.r0.125 2.27MB
en.train.r0.125 2.19MB
pt.train.r0.125 593KB
en.train.r0.125 564KB
ru.train.r0.25 8.66MB
en.train.r0.25 5.04MB
tr.train.r0.25 4.54MB
en.train.r0.25 4.37MB
pt.train.r0.25 1.15MB
en.train.r0.25 1.1MB
ru.train.r0.5 17.34MB
en.train.r0.5 10.08MB
tr.train.r0.5 9.08MB
en.train.r0.5 8.75MB
pt.train.r0.5 2.31MB
en.train.r0.5 2.2MB
nonbreaking_prefix.ca 249B
nonbreaking_prefix.cs 2KB
nonbreaking_prefix.de 2KB
ru.dev 796KB
en.dev 461KB
tr.dev 402KB
en.dev 385KB
ru.dev 163KB
he.dev 121KB
fr.dev 103KB
pt.dev 101KB
pt.dev 100KB
pt.dev 100KB
it.dev 98KB
pt.dev 97KB
pt.dev 97KB
en.dev 94KB
pt.dev 86KB
es.dev 84KB
en.dev 65KB
en.dev 65KB
gl.dev 65KB
gl.dev 65KB
az.dev 60KB
az.dev 60KB
en.dev 56KB
en.dev 56KB
be.dev 36KB
be.dev 36KB
en.dev 20KB
en.dev 20KB
nonbreaking_prefix.el 17KB
corpus.bpe.ref.en 181KB
corpus.en 130KB
nonbreaking_prefix.en 1KB
nonbreaking_prefix.es 835B
nonbreaking_prefix.fi 1KB
nonbreaking_prefix.fr 1008B
.gitignore 2KB
.gitignore 1KB
.gitignore 232B
.gitignore 212B
.gitignore 8B
nonbreaking_prefix.hu 1KB
long-ze-nmt.iml 484B
transformer_base.iml 441B
nonbreaking_prefix.is 1KB
nonbreaking_prefix.it 2KB
.keep 0B
LICENSE 1KB
nonbreaking_prefix.lv 1KB
README.md 7KB
CHANGELOG.md 2KB
README.md 183B
ru.train.n40000 5.81MB
he.train.n40000 4.51MB
fr.train.n40000 3.81MB
pt.train.n40000 3.58MB
pt.train.n40000 3.57MB
pt.train.n40000 3.56MB
pt.train.n40000 3.55MB
pt.train.n40000 3.55MB
it.train.n40000 3.52MB
es.train.n40000 3.48MB
.name 11B
nonbreaking_prefix.nl 2KB
ru.train.out 11KB
ru.train.out 11KB
be.train.out 11KB
be-ru.train.out 11KB
he.train.out 10KB
pt.train.out 8KB
pt.train.out 8KB
pt.train.out 8KB
fr.train.out 8KB
pt.train.out 8KB
pt.train.out 8KB
pt.train.out 8KB
en.train.out 8KB
en.train.out 8KB
it.train.out 8KB
gl-pt.train.out 8KB
共 275 条
- 1
- 2
- 3
资源评论
猰貐的新时代
- 粉丝: 1w+
- 资源: 2900
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 华盈恒信—金德精密—金德实业心理特征测评量表答题卡.doc
- 华盈恒信—金德精密—金德实业管理人员心理特征分析报告(发布版).ppt
- 华盈恒信—西洋肥业心理特征测评量表答题卡(1).doc
- 华盈恒信—金德精密—金德实业心理特征测评评价标准(1).doc
- 基于FPGA设计的数字时钟课程设计源码+文档说明(高分项目)
- 机械设计四轴定位装置sw18可编辑全套设计资料100%好用.zip
- 交流能力测评.doc
- 03.阿里巴巴20XX校招软件笔试题经典(含答案).doc
- 04.百度校招笔试题.doc
- 11.外企面试问题大全.doc
- 08.面谈构成表.doc
- 14.校园招聘面试小组讨论题目.doc.doc
- Java项目:校园周边美食探索(java+SpringBoot+Mybaits+Vue+elementui+mysql)
- 关于市场部拓展员面试的十大问题.doc
- 市场部经理面试技巧大全.docx
- 市场营销人员结构化面试题目.docx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功