【免费】YouOnlyCacheOnce:Decoder-DecoderArchitecturesforLanguage资源-CSDN文库

深度学习

需积分: 0 18 浏览量 2024-05-23 18:00:08 上传评论收藏 928KB PDF 举报

资源推荐

资源详情

资源评论

You Only Cache Once:

Decoder-Decoder Architectures for Language Models

Yutao Sun

∗ †‡

Li Dong

∗ †

Yi Zhu

†

Shaohan Huang

†

Wenhui Wang

†

Shuming Ma

†

Quanlu Zhang

†

Jianyong Wang

‡

Furu Wei

†⋄

†

Microsoft Research

‡

Tsinghua University

https://aka.ms/GeneralAI

Abstract

We introduce a decoder-decoder architecture, YOCO, for large language models,

which only caches key-value pairs once. It consists of two components, i.e., a cross-

decoder stacked upon a self-decoder. The self-decoder efﬁciently encodes global

key-value (KV) caches that are reused by the cross-decoder via cross-attention.

The overall model behaves like a decoder-only Transformer, although YOCO

only caches once. The design substantially reduces GPU memory demands, yet

retains global attention capability. Additionally, the computation ﬂow enables

preﬁlling to early exit without changing the ﬁnal output, thereby signiﬁcantly

speeding up the preﬁll stage. Experimental results demonstrate that YOCO achieves

favorable performance compared to Transformer in various settings of scaling

up model size and number of training tokens. We also extend YOCO to 1M

context length with near-perfect needle retrieval accuracy. The proﬁling results

show that YOCO improves inference memory, preﬁll latency, and throughput by

orders of magnitude across context lengths and model sizes. Code is available at

https://aka.ms/YOCO.

120

180

GPU Memory ↓

(GB)

Throughput ↑

(wps)

Prefilling Latency ↓

(s)

6.4X

30.3X

9.6X

Inference Cost (@512k)

Decoder-Decoder LLM

YOCOTransformer

Cross-Decoder

<s> You Only Cache

You Only Cache Once

Self-Decoder

KV Cache

Figure 1: We propose a decoder-decoder architecture, YOCO, for large language model, which only

caches key/value once. YOCO markedly reduces the KV cache memory and the preﬁlling time, while

being scalable in terms of training tokens, model size, and context length. The inference cost is

reported to be 512K as the context length, and Figures 7–10 present more results for different lengths.

∗

Equal contribution. ⋄ Corresponding author.

arXiv:2405.05254v2 [cs.CL] 9 May 2024

1 Introduction

The decoder-only Transformer [

VSP

] has become the de facto architecture for language models.

Numerous efforts have continued to develop suitable architectures for language modeling. There have

been main strands of explorations. First, encoder-only language models, such as BERT [

DCLT19

bidirectionally encode the input sequence. Second, encoder-decoder models, such as T5 [

RSR

use a bidirectional encoder to encode input and a unidirectional decoder to generate output. Both

of the above layouts struggle with autoregressive generation due to bidirectionality. Speciﬁcally,

encoders have to encode the whole input and output tokens again for the next generation step.

Although encoder-decoder can use only decoder to generate, the output tokens do not fully leverage

the parameters of encoder, especially for multi-turn conversation. Third, decoder-only language

models, such as GPT [

BMR

], generate tokens autoregressively. By caching the previously

computed key/value vectors, the model can reuse them for the current generation step. The key-value

(KV) cache avoids encoding the history again for each token, greatly improving the inference speed.

This compelling feature establishes the decoder-only language model as the standard option.

However, as the number of serving tokens increases, the KV caches occupy a lot of GPU memory,

rendering the inference of large language models memory-bounded [

PDC

]. For the example

of a 65B-size language model (augmented with grouped-query attention [

ALTdJ

] and 8-bit

KV quantization), 512K tokens occupy about 86GB GPU memory, which is even larger than the

capacity of one H100-80GB GPU. In addition, the preﬁlling latency of long-sequence input is

extremely high. For instance, using four H100 GPUs, the 7B language model (augmented with

Flash-Decoding [

DHMS23

] and kernel fusion) requires about 110 seconds to preﬁll 450K tokens, and

380 seconds for 1M length. The above bottlenecks make it difﬁcult to deploy long-context language

models in practice.

In this work, we propose a decoder-decoder architecture, YOCO, for large language models, which

only caches KV pairs once. Speciﬁcally, we stack cross-decoder upon self-decoder. Given an input

sequence, the self-decoder utilizes efﬁcient self-attention to obtain KV caches. Then the cross-decoder

layers employ cross-attention to reuse the shared KV caches. The decoder-decoder architecture is

conceptually similar to encoder-decoder, but the whole model behaves more like a decoder-only

model from the external view. So, it naturally ﬁts into autoregressive generation tasks, such as

language modeling. First, because YOCO only caches once

, the GPU memory consumption of KV

caches is signiﬁcantly reduced. Second, the computation ﬂow of the decoder-decoder architecture

enables preﬁlling to early exit before entering the self-decoder. The nice property speeds up the preﬁll

stage dramatically, improving user experience for long-context language models. Third, YOCO

allows for more efﬁcient system design for distributed long-sequence training. In addition, we

propose gated retention for self-decoder, which augments retention [

SDH

] with a data-controlled

gating mechanism.

We conduct extensive experiments to show that YOCO achieves favorable language modeling perfor-

mance and has many advantages in terms of inference efﬁciency. Experimental results demonstrate

that YOCO can be scaled up with more training tokens, larger model size, and longer context length.

Speciﬁcally, we scale up the 3B YOCO model to trillions of training tokens, attaining results on par

with prominent Transformer language models, such as StableLM [

TBMR

]. Moreover, the scaling

curves ranging from 160M to 13B show that YOCO are competitive compared to Transformer. We

also extend the context length of YOCO to 1M tokens, achieving near perfect needle retrieval accuracy.

In the multi-needle test, YOCO obtains competitive results even compared to larger Transformers.

In addition to good performance on various tasks, the proﬁling results show that YOCO improves the

GPU memory footprint, preﬁll latency, throughput, and serving capacity. In particular, the memory of

KV caches can be reduced by about

80×

for 65B models. Even for a 3B model, the overall inference

memory consumption can be reduced by two times for 32K tokens and by more than nine times for

1M tokens. The preﬁll stage is speeded up by

71.8×

for the 1M context and

2.87×

for the 32K input.

For example, for a 512K context, YOCO reduces the Transformer preﬁlling latency from 180 seconds

to less than six seconds. The results position YOCO as a strong candidate model architecture for

future large language models with native long-sequence support.

The word “once” refers to global KV cache. Strictly, self-decoder also needs to store a certain number of

caches. As the self-decoder utilizes an efﬁcient attention module, the cache size is bounded to a constant, which

can be ignored compared to global caches when the sequence length is large.

Figure 2: Overview of the decoder-decoder architecture. Self-decoder generates the global KV cache.

Then cross-decoder employs cross-attention to reuse the shared KV caches. Both self-decoder and

cross-decoder use causal masking. The overall architecture behaves like a decoder-only Transformer,

autoregressively generating tokens.

2 You Only Cache Once (YOCO)

The proposed architecture, named YOCO, is designed for autoregressive modeling, such as large

language models (LLMs). As shown in Figure 2, the decoder-decoder architecture has two parts, i.e.,

self-decoder and cross-decoder. Speciﬁcally, YOCO is stacked with

blocks, where the ﬁrst

layers

are self-decoder while the rest modules are cross-decoder. Given an input sequence

x = x

· · · x

|x|

the input embeddings are packed into

= [x

, · · · , x

|x|

] ∈ R

|x|×d

model

, where

model

is hidden

dimension. We ﬁrst obtain contextualized vector representations

= Self-Decoder(X

l−1

), l ∈

[1,

]

, where

is used to produce KV caches

for cross-decoder. Then we compute

= Cross-Decoder(X

l−1

V ), l ∈ [

+ 1, L] to get the output vectors X

Both self- and cross-decoder follow a similar block layout (i.e., interleaved attention and feed-forward

network) as in Transformer [

VSP

]. We also include pre-RMSNorm [

ZS19

], SwiGLU [

Sha20

and grouped-query attention [

ALTdJ

] as improvements. The difference between the two parts

lies in attention modules. Self-decoder (Section 2.1) uses efﬁcient self-attention (e.g., sliding-window

attention). In comparison, cross-decoder (Section 2.2) uses global cross-attention to attend to the

shared KV caches produced by the output of the self-decoder.

2.1 Self-Decoder

Self-decoder takes token embeddings

as input and compute intermediate vector representation

M = X

= ESA(LN(X

)) + X

l+1

= SwiGLU(LN(Y

)) + Y

(1)

where

ESA(·)

represents efﬁcient self-attention,

SwiGLU(X) = (swish(XW

) ⊙ XW

, and

RMSNorm [ZS19] is used for LN(·). Causal masking is used for efﬁcient self-attention.

Cross-Decoder

(Skipped)

Self-Decoder

KV Cache

Cross-Decoder

Prefilling Generation

Pre- filling context and then generate

then generate new

Figure 3: YOCO Inference. Preﬁll: encode input to-

kens in parallel. Generation: decode output tokens

one by one. The computation ﬂow enables preﬁll-

ing to early exit without changing the ﬁnal output,

thereby signiﬁcantly speeding up the preﬁll stage.

KV Cache Memory

Transformer O(LND)

YOCO O((N + L)D)

Table 1: Inference memory complexity of KV

caches.

N, L, D

are the sequence length, num-

ber of layers, and hidden dimension.

Preﬁlling Time

Transformer O(LN

YOCO O(LND)

Table 2: Preﬁlling time complexity of attention

modules. N, L, D are the same as above.

The key property of the efﬁcient self-attention module is

O(1)

inference memory, i.e., constant

number of KV caches. For example, the cache size of sliding-window attention [CGRS19] depends

on the window size instead of the input length. More design choices (e.g., gated retention) of the

efﬁcient self-attention module are detailed in Section 3.

2.2 Cross-Decoder

First, the output of the self-decoder X

generates global KV caches

V for cross-decoder:

K = LN(X

V = LN(X

(2)

where

, W

∈ R

d×d

are learnable weights. Then, cross-decoder layers are stacked after the

self-decoder to obtain the ﬁnal output vectors

. The KV caches

are reused by all the

cross-decoder modules:

= LN(X

= Attention(

V ) + X

l+1

= SwiGLU(LN(Y

)) + Y

(3)

where

Attention(·)

is standard multi-head attention [

VSP

], and

∈ R

d×d

is a learnable

matrix. Causal masking is also used for cross-attention. Because cross-attention is compatible with

group query attention [

ALTdJ

], we can further save the memory consumption of KV caches.

After obtaining X

, a softmax classiﬁer performs next-token prediction.

2.3 Inference Advantages

In addition to competitive language modeling results, YOCO signiﬁcantly reduces serving costs and

improves inference performance. We report detailed inference comparisons in Section 4.4.

Saving GPU Memory and Serving More Tokens. Table 1 compares the memory complexity

between Transformers and YOCO. Speciﬁcally, because global KV caches are reused and efﬁcient

self-attention needs constant caches, the number of caches is

O(N + CL)

, where

is the input

length,

is a constant (e.g., sliding window size), and

is the number of layers. For long sequences,

CL is much smaller than N , so about O(N) caches are required, i.e., you only cache once.

In comparison, Transformer decoders have to store

N × L

keys and values during inference. So

YOCO roughly saves

times GPU memory for caches compared to Transformer decoders. Because

the inference capacity bottleneck becomes KV caches (Figure 7b), our method enables us to serve

剩余19页未读，继续阅读

评论收藏

内容反馈

Hefin_H

粉丝: 170
资源: 3

You Only Cache Once: Decoder-Decoder Architectures for Language

LDAC-Decoder-Evaluation-Kit-lib-for-ADK6.4.0-E.pdf

atlassian-extras-decoder-v2-3.3.0.jar

RetinaFace: Single-stage Dense Face Localisation in the Wild.pdf

Python库 | velodyne-decoder-2.1.0.tar.gz

node-autodetect-decoder-stream:iconv-lite流，可自动检测编码并回退到指定的回退编码

isi-rewrite-decoder-r1.0.0a-linux解码器

PyPI 官网下载 | can_decoder-0.0.1-py3-none-any.whl

JPEGDecoder-master_jpeg_

计算机-GPT产业梳理：GPT-1到ChatGPT

开源项目-carlmjohnson-decoder-ring.zip

Zend-Decoder-master.zip_?ONCUBE DECODE_ionCube decrypt_smooths3t

protobuf-decoder-master.zip

SegNet A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.

libmp3decoder-1[1].0

Decoder-Improved-源码.rar

Python库 | velodyne_decoder-1.0.1-cp39-cp39-win_amd64.whl

PyPI 官网下载 | velodyne_decoder-1.0.1-cp39-cp39-win_amd64.whl

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

YOLOV5 + 双目相机实现三维测距（新版本）

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

全新的SOTA模型YOLOv9

最新资源