LLM预备知识-attentionisallyouneed资源-CSDN文库

172 浏览量 2023-12-11 14:12:41 上传评论收藏 2.06MB PDF 举报

Transformer 模型架构的设计理念和优势本文主要介绍 Transformer 模型架构的设计理念和优势，Transformer 模型是基于自注意力机制的序列转换模型，摒弃了传统的循环神经网络（RNN）和卷积神经网络（CNN），并证明了其在机器翻译任务上的优越性。一、Transformer 模型架构的设计理念传统的序列转换模型基于循环神经网络（RNN）或卷积神经网络（CNN），但是这些模型存在一些缺陷，如计算复杂、难以并行计算等。Transformer 模型架构的设计理念是基于自注意力机制，摒弃了循环和卷积操作，使用自注意力机制来模拟序列之间的依赖关系。 Transformer 模型架构主要由 encoder 和 decoder 两个部分组成，encoder 负责将输入序列转换为连续的表示，decoder 负责将连续的表示转换回输出序列。自注意力机制是 Transformer 模型架构的核心组件，负责模拟序列之间的依赖关系。二、Transformer 模型架构的优势 Transformer 模型架构具有多个优势，如： 1. 并行计算能力强，在训练和推理阶段可以并行计算，提高计算效率。 2. 计算复杂度低，摒弃了循环和卷积操作，减少了计算复杂度。 3. 可以处理长序列，自注意力机制可以模拟长序列之间的依赖关系。 4. 可以并行处理多个任务，Transformer 模型架构可以并行处理多个机器翻译任务，提高计算效率。三、Transformer 模型架构在机器翻译任务上的应用 Transformer 模型架构在机器翻译任务上的应用效果非常好，在 WMT 2014 英文到德文翻译任务上，Transformer 模型架构的 BLEU 分数达到 28.4，超越了之前的最好结果。在 WMT 2014 英文到法文翻译任务上，Transformer 模型架构的 BLEU 分数达到 41.8，创造了新的单模型状态的艺术记录。四、Transformer 模型架构的泛化能力 Transformer 模型架构不仅可以应用于机器翻译任务，也可以应用于其他自然语言处理任务，如英语成分分析任务。在英语成分分析任务上，Transformer 模型架构可以达到很好的结果，证明了其泛化能力。 Transformer 模型架构是一种基于自注意力机制的序列转换模型，具有多个优势，如并行计算能力强、计算复杂度低、可以处理长序列等，可以应用于多种自然语言处理任务。

资源推荐

资源详情

资源评论

Attention Is All You Need

Ashish Vaswani

∗

Google Brain

avaswani@google.com

Noam Shazeer

∗

Google Brain

noam@google.com

Niki Parmar

∗

Google Research

nikip@google.com

Jakob Uszkoreit

∗

Google Research

usz@google.com

Llion Jones

∗

Google Research

llion@google.com

Aidan N. Gomez

∗ †

University of Toronto

aidan@cs.toronto.edu

Łukasz Kaiser

∗

Google Brain

lukaszkaiser@google.com

Illia Polosukhin

∗ ‡

illia.polosukhin@gmail.com

Abstract

The dominant sequence transduction models are based on complex recurrent or

convolutional neural networks that include an encoder and a decoder. The best

performing models also connect the encoder and decoder through an attention

mechanism. We propose a new simple network architecture, the Transformer,

based solely on attention mechanisms, dispensing with recurrence and convolutions

entirely. Experiments on two machine translation tasks show these models to

be superior in quality while being more parallelizable and requiring signiﬁcantly

less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-

to-German translation task, improving over the existing best results, including

ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,

our model establishes a new single-model state-of-the-art BLEU score of 41.8 after

training for 3.5 days on eight GPUs, a small fraction of the training costs of the

best models from the literature. We show that the Transformer generalizes well to

other tasks by applying it successfully to English constituency parsing both with

large and limited training data.

1 Introduction

Recurrent neural networks, long short-term memory [

] and gated recurrent [

] neural networks

in particular, have been ﬁrmly established as state of the art approaches in sequence modeling and

∗

Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started

the effort to evaluate this idea. Ashish, with Illia, designed and implemented the ﬁrst Transformer models and

has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head

attention and the parameter-free position representation and became the other person involved in nearly every

detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and

tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and

efﬁcient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and

implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating

our research.

†

Work performed while at Google Brain.

‡

Work performed while at Google Research.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

arXiv:1706.03762v5 [cs.CL] 6 Dec 2017

Transformer是第一种完全依赖attention的序列转

换模型。使用与multi-headed self-attention结合

的encode-decode架构替换迭代层。

transduction problems such as language modeling and machine translation [

]. Numerous

efforts have since continued to push the boundaries of recurrent language models and encoder-decoder

architectures [38, 24, 15].

Recurrent models typically factor computation along the symbol positions of the input and output

sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden

states

, as a function of the previous hidden state

t−1

and the input for position

. This inherently

sequential nature precludes parallelization within training examples, which becomes critical at longer

sequence lengths, as memory constraints limit batching across examples. Recent work has achieved

signiﬁcant improvements in computational efﬁciency through factorization tricks [

] and conditional

computation [

], while also improving model performance in case of the latter. The fundamental

constraint of sequential computation, however, remains.

Attention mechanisms have become an integral part of compelling sequence modeling and transduc-

tion models in various tasks, allowing modeling of dependencies without regard to their distance in

the input or output sequences [

]. In all but a few cases [

], however, such attention mechanisms

are used in conjunction with a recurrent network.

In this work we propose the Transformer, a model architecture eschewing recurrence and instead

relying entirely on an attention mechanism to draw global dependencies between input and output.

The Transformer allows for signiﬁcantly more parallelization and can reach a new state of the art in

translation quality after being trained for as little as twelve hours on eight P100 GPUs.

2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU

[

], ByteNet [

] and ConvS2S [

], all of which use convolutional neural networks as basic building

block, computing hidden representations in parallel for all input and output positions. In these models,

the number of operations required to relate signals from two arbitrary input or output positions grows

in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes

it more difﬁcult to learn dependencies between distant positions [

]. In the Transformer this is

reduced to a constant number of operations, albeit at the cost of reduced effective resolution due

to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as

described in section 3.2.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions

of a single sequence in order to compute a representation of the sequence. Self-attention has been

used successfully in a variety of tasks including reading comprehension, abstractive summarization,

textual entailment and learning task-independent sentence representations [4, 27, 28, 22].

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-

aligned recurrence and have been shown to perform well on simple-language question answering and

language modeling tasks [34].

To the best of our knowledge, however, the Transformer is the ﬁrst transduction model relying

entirely on self-attention to compute representations of its input and output without using sequence-

aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate

self-attention and discuss its advantages over models such as [17, 18] and [9].

3 Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [

Here, the encoder maps an input sequence of symbol representations

, ..., x

)

to a sequence

of continuous representations

z = (z

, ..., z

)

. Given

, the decoder then generates an output

sequence

, ..., y

)

of symbols one element at a time. At each step the model is auto-regressive

[10], consuming the previously generated symbols as additional input when generating the next.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully

connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,

respectively.

Transformer，一个避免迭代的模型，它完全依赖一种

attention机制来描述输入和输出之间的依赖关系。

自回归

Figure 1: The Transformer - model architecture.

3.1 Encoder and Decoder Stacks

Encoder:

The encoder is composed of a stack of

N = 6

identical layers. Each layer has two

sub-layers. The ﬁrst is a multi-head self-attention mechanism, and the second is a simple, position-

wise fully connected feed-forward network. We employ a residual connection [

] around each of

the two sub-layers, followed by layer normalization [

]. That is, the output of each sub-layer is

LayerNorm(x + Sublayer(x))

, where

Sublayer(x)

is the function implemented by the sub-layer

itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding

layers, produce outputs of dimension d

model

= 512.

Decoder:

The decoder is also composed of a stack of

N = 6

identical layers. In addition to the two

sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head

attention over the output of the encoder stack. Similar to the encoder, we employ residual connections

around each of the sub-layers, followed by layer normalization. We also modify the self-attention

sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This

masking, combined with fact that the output embeddings are offset by one position, ensures that the

predictions for position i can depend only on the known outputs at positions less than i.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output,

where the query, keys, values, and output are all vectors. The output is computed as a weighted sum

of the values, where the weight assigned to each value is computed by a compatibility function of the

query with the corresponding key.

encode包括N=6个完全相同的layer，每个layer有两个sub-layer：即multi-head

self-attention、fully connected feed-forward network。

对这两个sub-layer都执行残差连接，随后都有一个normalization layer。

残差连接

skip

connection

decode与encode相比，插入了第三个sub-layer，以对

encode的输出执行multi-head attention

这个masking，加上输出嵌入

被一个位置偏移的事实，确

保了位置i的预测只能依赖于

位置i以前的已知输出。

剩余14页未读，继续阅读

评论收藏

内容反馈

lucky_chaichai

粉丝: 7189
资源: 5

LLM预备知识-attention is all you need

最新资源

LLM预备知识-attention is all you need

The Document is All You Need！一站式 LLM底层技术原理入门指南.pdf

tensorrt-llm-0.8.0-cp310-cp310-win-amd64.whl

llm-medical-data用于大模型微调训练的医疗数据集_llm-medical-data.zip

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena阅读笔记

LangChain-for-LLM-Application-Development-main.zip

LLM-RAG-WEB 大模型本地知识库召回

LLM大语言模型可视化三维演示，LLM-viz_LLM-viz-cn中文翻译.zip

OWASP-Top-10-for-LLM-Applications-v1_1_Chinese.pdf

大模型驱动的虚拟主播_LLM-virtual-bilibilier.zip

Building LLM Powered Applications-2024.5.pdf

大模型辅助的医学免疫学情景化考题自动出题系统LLM-question-setting-system-master.zip

byzer-llm-3.3-2.12-0.1.0-SNAPSHOT.jar

仅使用一个python文件，创建实时语音识别+大模型回复的GUI程序_Realtime-STT2LLM-in-1-py.zip

使用各种大语言模型配合沉浸式翻译插件Immersive_Translate完成文档翻译_llm-translate-x.zip

LLM群聊框架-同时与多个LLM聊天. 大模型群聊框架：同时与多个大语言模型聊天。_OpenA.zip

算法部署-使用TensorRT-LLM部署大模型-附详细优化+分析流程教程-优质大模型部署项目实战.zip

LLM Concepts Guide - 谷歌大型语言模型概念指南.pdf

MindSpeed-LLM作为昇腾大模型训练框架，旨在为华为 昇腾芯片 提供端到端的大语言模型训练方案,

tensorrt-llm-0.7.0-cp310-cp310-win-amd64.whl

tensorrt-llm-0.7.1-cp310-cp310-win-amd64.whl

tensorrt-llm-0.5.0-0-cp310-cp310-win-amd64.whl

pandas_llm-0.0.4-py3-none-any.whl

pandas_llm-0.0.6-py3-none-any.whl

pandas_llm-0.0.3-py3-none-any.whl

pandas_llm-0.0.5-py3-none-any.whl

pandas_llm-0.0.2-py3-none-any.whl

大模型微调入门 LLM-quickstart-main

tensorrt-llm-0.6.1-cp310-cp310-win-amd64.whl

基于LLM的知识图谱补全研究

最新资源

MindSpeed-LLM作为昇腾大模型训练框架，旨在为华为昇腾芯片提供端到端的大语言模型训练方案,