transformers_without_tears-master.zip资源-CSDN文库

共22个文件

py：9个

vi：3个

en：3个

版权申诉

110 浏览量 2024-04-07 21:14:01 上传评论收藏 9.98MB ZIP 举报

《深入理解Transformer模型：从零开始构建》 Transformer模型，由Vaswani等人在2017年的论文《Attention is All You Need》中提出，是自然语言处理领域的一个重大突破。它彻底改变了序列建模的方式，从RNN（循环神经网络）和LSTM（长短时记忆网络）的传统依赖于顺序的计算模式，转向了基于自注意力机制的并行计算，极大地提高了处理速度。本项目"transformers_without_tears-master"旨在帮助读者深入理解Transformer模型的原理，并通过实践来掌握其构建过程。我们来看Transformer的核心组成部分——自注意力（Self-Attention）机制。自注意力允许模型在处理序列数据时，对每个位置的信息进行全局的上下文感知，而不是仅依赖于当前位置的前后信息。这使得Transformer能够有效地捕捉到长距离的依赖关系。自注意力由查询（Query）、键（Key）和值（Value）三个部分组成，通过计算查询与键之间的相似度来决定不同位置的权重，然后根据这些权重对值进行加权求和，形成新的表示。 Transformer模型由多个层堆叠而成，每层包含两个主要部分：多头自注意力（Multi-Head Self-Attention）和前馈神经网络（Feed-Forward Network）。多头自注意力通过并行计算多个独立的自注意力，增加了模型可以捕捉的信息维度，而前馈神经网络则用于进一步的非线性变换和特征提取。此外，每一层还包括残差连接（Residual Connection）和层归一化（Layer Normalization），有助于梯度传播和模型的稳定训练。在实际应用中，Transformer通常会加入位置编码（Positional Encoding）以保留序列信息，因为自注意力机制本身不包含位置信息。位置编码通常通过正弦和余弦函数生成，使得模型能够区分不同位置的输入。 "transformers_without_tears-master"项目提供了详细的步骤，指导用户从零开始实现Transformer模型。其中"a.txt"可能是一个示例输入文件，用于展示模型的应用。而项目目录"transformers_without_tears-master"可能包含了源代码、数据集、训练脚本和模型配置等资源。通过学习和实践这个项目，读者不仅可以理解Transformer的理论基础，还能掌握如何在实践中应用和调整Transformer模型，包括设置超参数、优化器的选择、损失函数的定义以及模型的训练和评估等。这对于深入理解自然语言处理的最新进展，以及在相关任务上开发自己的解决方案具有极大的价值。 Transformer模型以其创新的自注意力机制和并行计算能力，开启了深度学习在序列建模领域的崭新篇章。"transformers_without_tears-master"项目为开发者提供了一个深入了解和动手实践Transformer的宝贵机会，对于提升AI和NLP技术的实战技能大有裨益。

资源推荐

资源详情

资源评论

收起资源包目录

transformers_without_tears-master.zip （22个子文件）

transformers_without_tears-master

utils.py 2KB

controller.py 19KB

data_manager.py 7KB

ace.jpg 344KB

main.py 3KB

data

readme.md 78B

en2vi

train.vi 17.18MB

train.en 12.93MB

test.vi 180KB

dev.vi 184KB

test.en 129KB

dev.en 137KB

LICENSE 1KB

layers.py 18KB

configurations.py 1KB

readme.md 10KB

model.py 6KB

preprocessing.py 9KB

all_constants.py 480B

scripts

multi-bleu.perl 5KB

multi-bleu-detok.perl 6KB

a.txt 0B

# Ace: An implementation of Transformer in Pytorch [Toan Q. Nguyen](http://tnq177.github.io), University of Notre Dame This is the *re*-implementation of the paper [Transformers without Tears: Improving the Normalization of Self-Attention](https://arxiv.org/pdf/1910.05895.pdf). While the code was initially developed to experiment with multilingual NMT, all experiments mentioned in the paper and also in this guide are meant for bilingual only. Regarding the multilingual parts of the code, I followed [XLM](https://github.com/facebookresearch/XLM) and added the following changes: * language embedding: each language has a an embedding vector which is summed to input word embeddings, similar to positional embedding * oversampling data before BPE: First training sentences are grouped by languages, then a heuristic multinomial distribution is calculated based on the size of each language. We sample sentences from each language according to this distribution so that rarer languages are better represented and won't be broken into very short BPE segments. See [their paper](https://arxiv.org/abs/1901.07291) for more information. My own implementation is in `preprocessing.py` If we train for bilingual only, adding language embedding and oversampling data won't make any difference (according to my early experiments). I, however, keep them in the code since they might be useful later. This code has been tested with only Python 3.6 and PyTorch 1.4. Pretrained models (retrained, not the ones from the paper): [ted gl2en with warmup](https://drive.google.com/file/d/1yhzSLtAHTOjVTFPdpydrBRlm8lhx2jm8/view?usp=sharing), [iwslt15 en-vi with warmup](https://drive.google.com/file/d/1E4suCr-UDlMtjeNCdTrilYU8Szw56m2X/view?usp=sharing) and [without warmup](https://drive.google.com/file/d/1zgnOr1PmHEdt6_0q2Ebt07eOcfnJU-EX/view) ## Input and Preprocessing Under a `data` directory, for each language pair of `src_lang` and `tgt_lang`, create a folder of name `src_lang2tgt_lang` which has the following files: train.src_lang train.tgt_lang dev.src_lang dev.tgt_lang test.src_lang test.tgt_lang The files should be tokenized. My rule of thumb for data preprocessing: * tokenize data * filter out sentences longer than 200-250 * learn BPE * apply BPE * don't filter again Transformer is known to not generalize well to sentences longer than what it's seen (see [this](https://arxiv.org/abs/1804.00247)), so we need long sentences during training. We don't have to worry about OOM because we always batch by number of tokens. Even a really long sentence of 250 words won't be have more than 2048/4096 BPE tokens. After that, install [fastBPE](https://github.com/glample/fastBPE). Then run: `python3 preprocessing.py --data-dir data --num-ops number_of_bpe_ops --pairs src_lang2tgt_lang --fast path_to_fastbpe_binary"` This will: * sample sentences from `train.{src_lang, tgt_lang}` to a joint text file * learn bpe from that * bpe-encode the rest of files * create vocabularies * convert data into ids and save as `.npy` files For example, if we're training an English-Vietnamese model (`en2vi`) using 8000 BPE operations, then the resultant directory looks like this: ``` data ├── en2vi │ ├── dev.en │ ├── dev.en.bpe │ ├── dev.en.npy │ ├── dev.vi │ ├── dev.vi.bpe │ ├── dev.vi.npy │ ├── test.en │ ├── test.en.bpe │ ├── test.vi │ ├── test.vi.bpe │ ├── train.en │ ├── train.en.bpe │ ├── train.en.npy │ ├── train.vi │ ├── train.vi.bpe │ └── train.vi.npy ├── joint_all.txt ├── joint.bpe ├── lang.vocab ├── mask.en.npy ├── mask.vi.npy ├── subjoint_en.txt ├── subjoint_en.txt.8000 ├── subjoint_en.txt.8000.vocab ├── subjoint_vi.txt ├── subjoint_vi.txt.8000 ├── subjoint_vi.txt.8000.vocab └── vocab.joint ``` ## Usage To train a new model: * write a new configuration function in `configurations.py` * run `python3 main.py --mode train --data-dir ./data --dump-dir ./dump --pairs src_lang2tgt_lang --config config_name` Note that I separate the two configs: * hyperparameters/training options: in `configurations.py` * what pairs are we training on, are we training or translating...: just see `main.py` Training is logged in `dump/DEBUG.log`. During training, the model is validated on the dev set, and the best checkpoint is saved to `dump/model-SCORE.pth` (also `dump/src_lang2tgt_lang-SCORE.pth`, they are the same). All best/beam translations, final training stats (train/dev perplexities)... are stored in `dump` as well. To translate using a checkpoint, run: `python3 main.py --mode translate --data-dir ./data --dump-dir ./dump --pairs src_lang2tgt_lang --files-langs data/src_lang2tgt_lang/temp,src_lang,tgt_lang --config src_lang2tgt_lang --model-file dump/src_lang2tgt_lang-SCORE.pth` ## Options Many options in `configurations.py` are pretty important: * ``use_bias``: if set to False, all linear layer won't use bias. Default to True which uses bias. * ``fix_norm``: fix the word embedding norm to 1 by dividing each word embedding vector by its l2 norm ([Improving Lexical Choice in Neural Machine Translation](https://aclweb.org/anthology/N18-1031)) * ``scnorm``: the ScaleNorm in our paper. This replaces Layer Normalization with a scaled l2-normalization layer. It works by first normalizing input to norm 1 (divide vector by its l2 norm) then scale up by a single, learnable parameter. See ``ScaleNorm`` in ``layers.py`` * ``mask_logit``: if set to True, for each target language, we set the logits of types that are not in its vocabulary to -inf (so after softmax, those probs become 0). The idea is, say src and tgt each has 8000 types in their vocabs, but only 1000 are shared, then we should not predict the other 7000 types in the source. * ``pre_act``: if True, do PreNorm (normalization->sublayer->residual-add), else do PostNorm (sublayer->residual-add->normalization). See the paper for more discussion and related works. * ``clip_grad``: gradient clipping value. I find 1.0 works well and stabilizes training as well. * ``warmup_steps``: number of warmup steps if we do warmup * ``lr_scheduler``: if `ORG_WU` (see `all_constants.py`), we follow the warmup-cooldown schedule in the [original paper](https://arxiv.org/abs/1706.03762). If ``NO_WU``, we use a constant learning rate ``lr`` which is then decayed whenever development BLEU has not improved over ``patience`` evaluations. If ``UPFLAT_WU`` then we do warmup, but then stay at the peak learning rate and decay like ``NO_WU``. * ``lr_scale``: multiply learning rate by this value * ``lr_decay``: decay factor (new_lr <-- old_lr * lr_decay) * ``stop_lr``: stop training when learning rate reaches this value * ``label_smoothing``: default to 0.1 like in original paper * ``batch_size``: number of src+tgt tokens in a batch * ``epoch_size``: number of iterations we consider one epoch * ``max_epochs``: maximum number of epochs we train for * ``dropout``: sublayer output's dropout rate * ``{att,ff,word}_dropout``: dropout rate for attention layer, feedforward and word-dropout. For word-dropout, we replace with UNK instead of zero-ing embeddings. I find word-dropout useful for training low-resource, bilingual model. * ``beam_{size, alpha}``: Beam size and length normalization using [Wu et al.'s magic formula](https://arxiv.org/abs/1609.08144) ## Some other notes Because this is my *re*-implementation from memory, there are many pieces of information I forget. I just want to clarify the followings: * The IWSLT English-Vietnamese dataset is from [paper](http://nlp.stanford.edu/pubs/luong-manning-iwslt15.pdf), [data](https://nlp.stanford.edu/projects/nmt/). The other IWSLT datasets are from [paper](https://www.aclweb.org/anthology/N18-2084/), [data](http

评论收藏

内容反馈

版权申诉