LoRA-大型语言模型的低秩适配器.pdf.zip资源-CSDN文库

共1个文件

pdf：1个

语言模型

需积分: 5 162 浏览量 2024-01-05 16:57:34 上传评论收藏 1.15MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

LoRA-大型语言模型的低秩适配器.pdf.zip （1个子文件）

LoRA- Low-Rank Adaptation of Large Language Models.pdf 1.53MB

LORA: LOW-RANK ADAPTATION OF LARGE LAN-

GUAGE MODELS

Edward Hu

∗

Yelong Shen

∗

Phillip Wallis Zeyuan Allen-Zhu

Yuanzhi Li Shean Wang Lu Wang Weizhu Chen

Microsoft Corporation

{edwardhu, yeshe, phwallis, zeyuana,

yuanzhil, swang, luw, wzchen}@microsoft.com

yuanzhil@andrew.cmu.edu

(Version 2)

ABSTRACT

An important paradigm of natural language processing consists of large-scale pre-

training on general domain data and adaptation to particular tasks or domains. As

we pre-train larger models, full ﬁne-tuning, which retrains all model parameters,

becomes less feasible. Using GPT-3 175B as an example – deploying indepen-

dent instances of ﬁne-tuned models, each with 175B parameters, is prohibitively

expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-

trained model weights and injects trainable rank decomposition matrices into each

layer of the Transformer architecture, greatly reducing the number of trainable pa-

rameters for downstream tasks. Compared to GPT-3 175B ﬁne-tuned with Adam,

LoRA can reduce the number of trainable parameters by 10,000 times and the

GPU memory requirement by 3 times. LoRA performs on-par or better than ﬁne-

tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite hav-

ing fewer trainable parameters, a higher training throughput, and, unlike adapters,

no additional inference latency. We also provide an empirical investigation into

rank-deﬁciency in language model adaptation, which sheds light on the efﬁcacy of

LoRA. We release a package that facilitates the integration of LoRA with PyTorch

models and provide our implementations and model checkpoints for RoBERTa,

DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

1 INTRODUCTION

Pretrained

Weights

  



  

  󰇛 



󰇜





Pretrained

Weights

  



f(x)



Figure 1: Our reparametriza-

tion. We only train A and B.

Many applications in natural language processing rely on adapt-

ing one large-scale, pre-trained language model to multiple down-

stream applications. Such adaptation is usually done via ﬁne-tuning,

which updates all the parameters of the pre-trained model. The ma-

jor downside of ﬁne-tuning is that the new model contains as many

parameters as in the original model. As larger models are trained

every few months, this changes from a mere “inconvenience” for

GPT-2 (Radford et al., b) or RoBERTa large (Liu et al., 2019) to a

critical deployment challenge for GPT-3 (Brown et al., 2020) with

175 billion trainable parameters.

Many sought to mitigate this by adapting only some parameters or

learning external modules for new tasks. This way, we only need

to store and load a small number of task-speciﬁc parameters in ad-

dition to the pre-trained model for each task, greatly boosting the

operational efﬁciency when deployed. However, existing techniques

∗

Equal contribution.

Compared to V1, this draft includes better baselines, experiments on GLUE, and more on adapter latency.

While GPT-3 175B achieves non-trivial performance with few-shot learning, ﬁne-tuning boosts its perfor-

mance signiﬁcantly as shown in Appendix A.

arXiv:2106.09685v2 [cs.CL] 16 Oct 2021

often introduce inference latency (Houlsby et al., 2019; Rebufﬁ et al., 2017) by extending model

depth or reduce the model’s usable sequence length (Li & Liang, 2021; Lester et al., 2021; Ham-

bardzumyan et al., 2020; Liu et al., 2021) (Section 3). More importantly, these method often fail to

match the ﬁne-tuning baselines, posing a trade-off between efﬁciency and model quality.

We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned

over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the

change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed

Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural

network indirectly by optimizing rank decomposition matrices of the dense layers’ change during

adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3

175B as an example, we show that a very low rank (i.e., r in Figure 1 can be one or two) sufﬁces even

when the full rank (i.e., d) is as high as 12,288, making LoRA both storage- and compute-efﬁcient.

LoRA possesses several key advantages.

• A pre-trained model can be shared and used to build many small LoRA modules for dif-

ferent tasks. We can freeze the shared model and efﬁciently switch tasks by replacing the

matrices A and B in Figure 1, reducing the storage requirement and task-switching over-

head signiﬁcantly.

• LoRA makes training more efﬁcient and lowers the hardware barrier to entry by up to 3

times when using adaptive optimizers since we do not need to calculate the gradients or

maintain the optimizer states for most parameters. Instead, we only optimize the injected,

much smaller low-rank matrices.

• Our simple linear design allows us to merge the trainable matrices with the frozen weights

when deployed, introducing no inference latency compared to a fully ﬁne-tuned model, by

construction.

• LoRA is orthogonal to many prior methods and can be combined with many of them, such

as preﬁx-tuning. We provide an example in Appendix E.

Terminologies and Conventions We make frequent references to the Transformer architecture

and use the conventional terminologies for its dimensions. We call the input and output di-

mension size of a Transformer layer d

model

. We use W

, W

, and W

to refer to the

query/key/value/output projection matrices in the self-attention module. W or W

refers to a pre-

trained weight matrix and ∆W its accumulated gradient update during adaptation. We use r to

denote the rank of a LoRA module. We follow the conventions set out by (Vaswani et al., 2017;

Brown et al., 2020) and use Adam (Loshchilov & Hutter, 2019; Kingma & Ba, 2017) for model

optimization and use a Transformer MLP feedforward dimension d

ffn

= 4 × d

model

2 PROBLEM STATEMENT

While our proposal is agnostic to training objective, we focus on language modeling as our motivat-

ing use case. Below is a brief description of the language modeling problem and, in particular, the

maximization of conditional probabilities given a task-speciﬁc prompt.

Suppose we are given a pre-trained autoregressive language model P

(y|x) parametrized by Φ.

For instance, P

(y|x) can be a generic multi-task learner such as GPT (Radford et al., b; Brown

et al., 2020) based on the Transformer architecture (Vaswani et al., 2017). Consider adapting this

pre-trained model to downstream conditional text generation tasks, such as summarization, machine

reading comprehension (MRC), and natural language to SQL (NL2SQL). Each downstream task is

represented by a training dataset of context-target pairs: Z = {(x

, y

)}

i=1,..,N

, where both x

and

are sequences of tokens. For example, in NL2SQL, x

is a natural language query and y

its

corresponding SQL command; for summarization, x

is the content of an article and y

its summary.

During full ﬁne-tuning, the model is initialized to pre-trained weights Φ

and updated to Φ

+ ∆Φ

by repeatedly following the gradient to maximize the conditional language modeling objective:

max

(x,y)∈Z

|y|

t=1

log (P

|x, y

)) (1)

One of the main drawbacks for full ﬁne-tuning is that for each downstream task, we learn a different

set of parameters ∆Φ whose dimension |∆Φ| equals |Φ

|. Thus, if the pre-trained model is large

(such as GPT-3 with |Φ

| ≈ 175 Billion), storing and deploying many independent instances of

ﬁne-tuned models can be challenging, if at all feasible.

In this paper, we adopt a more parameter-efﬁcient approach, where the task-speciﬁc parameter

increment ∆Φ = ∆Φ(Θ) is further encoded by a much smaller-sized set of parameters Θ with

|Θ|  |Φ

|. The task of ﬁnding ∆Φ thus becomes optimizing over Θ:

max

(x,y)∈Z

|y|

t=1

log



+∆Φ(Θ)

|x, y

)



(2)

In the subsequent sections, we propose to use a low-rank representation to encode ∆Φ that is both

compute- and memory-efﬁcient. When the pre-trained model is GPT-3 175B, the number of train-

able parameters |Θ| can be as small as 0.01% of |Φ

3 AREN’T EXISTING SOLUTIONS GOOD ENOUGH?

The problem we set out to tackle is by no means new. Since the inception of transfer learning, dozens

of works have sought to make model adaptation more parameter- and compute-efﬁcient. See Sec-

tion 6 for a survey of some of the well-known works. Using language modeling as an example, there

are two prominent strategies when it comes to efﬁcient adaptations: adding adapter layers (Houlsby

et al., 2019; Rebufﬁ et al., 2017; Pfeiffer et al., 2021; R

uckl

e et al., 2020) or optimizing some forms

of the input layer activations (Li & Liang, 2021; Lester et al., 2021; Hambardzumyan et al., 2020;

Liu et al., 2021). However, both strategies have their limitations, especially in a large-scale and

latency-sensitive production scenario.

Adapter Layers Introduce Inference Latency There are many variants of adapters. We focus

on the original design by Houlsby et al. (2019) which has two adapter layers per Transformer block

and a more recent one by Lin et al. (2020) which has only one per block but with an additional

LayerNorm (Ba et al., 2016). While one can reduce the overall latency by pruning layers or exploit-

ing multi-task settings (R

uckl

e et al., 2020; Pfeiffer et al., 2021), there is no direct ways to bypass

the extra compute in adapter layers. This seems like a non-issue since adapter layers are designed

to have few parameters (sometimes <1% of the original model) by having a small bottleneck di-

mension, which limits the FLOPs they can add. However, large neural networks rely on hardware

parallelism to keep the latency low, and adapter layers have to be processed sequentially. This makes

a difference in the online inference setting where the batch size is typically as small as one. In a

generic scenario without model parallelism, such as running inference on GPT-2 (Radford et al., b)

medium on a single GPU, we see a noticeable increase in latency when using adapters, even with a

very small bottleneck dimension (Table 1).

This problem gets worse when we need to shard the model as done in Shoeybi et al. (2020); Lep-

ikhin et al. (2020), because the additional depth requires more synchronous GPU operations such as

AllReduce and Broadcast, unless we store the adapter parameters redundantly many times.

Directly Optimizing the Prompt is Hard The other direction, as exempliﬁed by preﬁx tuning (Li

& Liang, 2021), faces a different challenge. We observe that preﬁx tuning is difﬁcult to optimize

and that its performance changes non-monotonically in trainable parameters, conﬁrming similar

observations in the original paper. More fundamentally, reserving a part of the sequence length for

adaptation necessarily reduces the sequence length available to process a downstream task, which

we suspect makes tuning the prompt less performant compared to other methods. We defer the study

on task performance to Section 5.

Batch Size 32 16 1

Sequence Length 512 256 128

|Θ| 0.5M 11M 11M

Fine-Tune/LoRA 1449.4±0.8 338.0±0.6 19.8±2.7

Adapter

1482.0±1.0 (+2.2%) 354.8±0.5 (+5.0%) 23.9±2.1 (+20.7%)

Adapter

1492.2±1.0 (+3.0%) 366.3±0.5 (+8.4%) 25.8±2.2 (+30.3%)

Table 1: Infernece latency of a single forward pass in GPT-2 medium measured in milliseconds, av-

eraged over 100 trials. We use an NVIDIA Quadro RTX8000. “|Θ|” denotes the number of trainable

parameters in adapter layers. Adapter

and Adapter

are two variants of adapter tuning, which we

describe in Section 5.1. The inference latency introduced by adapter layers can be signiﬁcant in an

online, short-sequence-length scenario. See the full study in Appendix B.

4 OUR METHOD

We describe the simple design of LoRA and its practical beneﬁts. The principles outlined here apply

to any dense layers in deep learning models, though we only focus on certain weights in Transformer

language models in our experiments as the motivating use case.

4.1 LOW-RANK-PARAMETRIZED UPDATE MATRICES

A neural network contains many dense layers which perform matrix multiplication. The weight

matrices in these layers typically have full-rank. When adapting to a speciﬁc task, Aghajanyan et al.

(2020) shows that the pre-trained language models have a low “instrisic dimension” and can still

learn efﬁciently despite a random projection to a smaller subspace. Inspired by this, we hypothe-

size the updates to the weights also have a low “intrinsic rank” during adaptation. For a pre-trained

weight matrix W

∈ R

d×k

, we constrain its update by representing the latter with a low-rank de-

composition W

+ ∆W = W

+ BA, where B ∈ R

d×r

, A ∈ R

r×k

, and the rank r  min(d, k).

During training, W

is frozen and does not receive gradient updates, while A and B contain trainable

parameters. Note both W

and ∆W = BA are multiplied with the same input, and their respective

output vectors are summed coordinate-wise. For h = W

x, our modiﬁed forward pass yields:

h = W

x + ∆W x = W

x + BAx (3)

We illustrate our reparametrization in Figure 1. We use a random Gaussian initialization for A and

zero for B, so ∆W = BA is zero at the beginning of training. We then scale ∆W x by

, where α

is a constant in r. When optimizing with Adam, tuning α is roughly the same as tuning the learning

rate if we scale the initialization appropriately. As a result, we simply set α to the ﬁrst r we try

and do not tune it. This scaling helps to reduce the need to retune hyperparameters when we vary

r (Yang & Hu, 2021).

A Generalization of Full Fine-tuning. A more general form of ﬁne-tuning allows the training of

a subset of the pre-trained parameters. LoRA takes a step further and does not require the accumu-

lated gradient update to weight matrices to have full-rank during adaptation. This means that when

applying LoRA to all weight matrices and training all biases

, we roughly recover the expressive-

ness of full ﬁne-tuning by setting the LoRA rank r to the rank of the pre-trained weight matrices. In

other words, as we increase the number of trainable parameters

, training LoRA roughly converges

to training the original model, while adapter-based methods converges to an MLP and preﬁx-based

methods to a model that cannot take long input sequences.

No Additional Inference Latency. When deployed in production, we can explicitly compute and

store W = W

+ BA and perform inference as usual. Note that both W

and BA are in R

d×k

When we need to switch to another downstream task, we can recover W

by subtracting BA and

then adding a different B

, a quick operation with very little memory overhead. Critically, this

They represent a negligible number of parameters compared to weights.

An inevitability when adapting to hard tasks.

guarantees that we do not introduce any additional latency during inference compared to a ﬁne-tuned

model by construction.

4.2 APPLYING LORA TO TRANSFORMER

In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the

number of trainable parameters. In the Transformer architecture, there are four weight matrices in

the self-attention module (W

, W

) and two in the MLP module. We treat W

(or W

, W

)

as a single matrix of dimension d

model

×d

model

, even though the output dimension is usually sliced

into attention heads. We limit our study to only adapting the attention weights for downstream

tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity

and parameter-efﬁciency.We further study the effect on adapting different types of attention weight

matrices in a Transformer in Section 7.1. We leave the empirical investigation of adapting the MLP

layers, LayerNorm layers, and biases to a future work.

Practical Beneﬁts and Limitations. The most signiﬁcant beneﬁt comes from the reduction in

memory and storage usage. For a large Transformer trained with Adam, we reduce that VRAM

usage by up to 2/3 if r  d

model

as we do not need to store the optimizer states for the frozen

parameters. On GPT-3 175B, we reduce the VRAM consumption during training from 1.2TB to

350GB. With r = 4 and only the query and value projection matrices being adapted, the checkpoint

size is reduced by roughly 10,000× (from 350GB to 35MB)

. This allows us to train with signiﬁ-

cantly fewer GPUs and avoid I/O bottlenecks. Another beneﬁt is that we can switch between tasks

while deployed at a much lower cost by only swapping the LoRA weights as opposed to all the

parameters. This allows for the creation of many customized models that can be swapped in and out

on the ﬂy on machines that store the pre-trained weights in VRAM. We also observe a 25% speedup

during training on GPT-3 175B compared to full ﬁne-tuning

as we do not need to calculate the

gradient for the vast majority of the parameters.

LoRA also has its limitations. For example, it is not straightforward to batch inputs to different tasks

with different A and B in a single forward pass, if one chooses to absorb A and B into W to eliminate

additional inference latency. Though it is possible to not merge the weights and dynamically choose

the LoRA modules to use for samples in a batch for scenarios where latency is not critical.

5 EMPIRICAL EXPERIMENTS

We evaluate the downstream task performance of LoRA on RoBERTa (Liu et al., 2019), De-

BERTa (He et al., 2021), and GPT-2 (Radford et al., b), before scaling up to GPT-3 175B (Brown

et al., 2020). Our experiments cover a wide range of tasks, from natural language understanding

(NLU) to generation (NLG). Speciﬁcally, we evaluate on the GLUE (Wang et al., 2019) benchmark

for RoBERTa and DeBERTa. We follow the setup of Li & Liang (2021) on GPT-2 for a direct com-

parison and add WikiSQL (Zhong et al., 2017) (NL to SQL queries) and SAMSum (Gliwa et al.,

2019) (conversation summarization) for large-scale experiments on GPT-3. See Appendix C for

more details on the datasets we use. We use NVIDIA Tesla V100 for all experiments.

5.1 BASELINES

To compare with other baselines broadly, we replicate the setups used by prior work and reuse their

reported numbers whenever possible. This, however, means that some baselines might only appear

in certain experiments.

Fine-Tuning (FT) is a common approach for adaptation. During ﬁne-tuning, the model is initialized

to the pre-trained weights and biases, and all model parameters undergo gradient updates.A simple

variant is to update only some layers while freezing others. We include one such baseline reported

in prior work (Li & Liang, 2021) on GPT-2, which adapts just the last two layers (FT

Top2

We still need the 350GB model during deployment; however, storing 100 adapted models only requires

350GB + 35MB * 100 ≈ 354GB as opposed to 100 * 350GB ≈ 35TB.

For GPT-3 175B, the training throughput for full ﬁne-tuning is 32.5 tokens/s per V100 GPU; with the same

number of weight shards for model parallelism, the throughput is 43.1 tokens/s per V100 GPU for LoRA.

评论收藏

内容反馈

李小白杂货铺

粉丝: 716
资源: 161

LoRA-大型语言模型的低秩适配器.pdf.zip

lora训练模型-lora-scripts-main.zip

ATK-LORA-01模块资料及源码.rar_ATK lora模块_ATK-LORA-01_NB-IOT 低功耗_atk-lor

alpaca-lora-airaria-lora.zip

ATK-LORA-01(F1)开发板驱动.zip

ATK-LORA-01无线串口模块资料.rar

lora-scripts训练模型.rar

【正点原子】LORA模块ATK-LORA-01资料.zip

LoRa-GWProject-V3.3-190325.rar（STM32F4例子程序）

lora-sx1262-master.zip

lora-101-cAT-demo-V1.0.rar

lora-mesh-master_Mesh_lora-mesh_meshlora_LORA节点源码_lora_源码.zip

lora-l101-pAT-demo-V1.0.rar

基于LoRA和 P-Tuning v2 的ChatGLM-6B高效参数微调python源码.zip

物联网制式LoRa-的技术特点及测试方案.pdf

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

民宿网站

桌面聊天室

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

2023泛娱乐社交出海手册-ZEGO即构科技

4个亲测好用的ChatGPT4渠道

HAI-2024斯坦福AI指数报告（中文译版）.pdf

学术海报模板+论文科研+研究生

北森能力测评题库.zip

chrome-win64.zip

认知智能技术与产业研究报告2023

最新资源