大规模语言模型微调中不同数据与方法对性能的影响研究资源-CSDN文库

版权申诉

自然语言处理

199 浏览量 2024-12-02 14:54:09 上传评论收藏 1.45MB PDF 举报

资源推荐

资源详情

资源评论

Published as a conference paper at ICLR 2024

WHEN SCALING MEETS LLM FINETUNING:

THE EFFECT OF DATA, MODEL AND FINETUNING

METHOD

Biao Zhang

†

Zhongtao Liu

⋄

Colin Cherry

⋄

Orhan Firat

†

Google DeepMind

⋄

Google Research

{biaojiaxing,zhongtao,colincherry,orhanf}@google.com

ABSTRACT

While large language models (LLMs) often adopt ﬁnetuning to unlock their ca-

pabilities for downstream applications, our understanding on the inductive biases

(especially the scaling properties) of different ﬁnetuning methods is still limited.

To ﬁll this gap, we conduct systematic experiments studying whether and how dif-

ferent scaling factors, including LLM model size, pretraining data size, new ﬁne-

tuning parameter size and ﬁnetuning data size, affect the ﬁnetuning performance.

We consider two types of ﬁnetuning – full-model tuning (FMT) and parameter ef-

ﬁcient tuning (PET, including prompt tuning and LoRA), and explore their scaling

behaviors in the data-limited regime where the LLM model size substantially out-

weighs the ﬁnetuning data size. Based on two sets of pretrained bilingual LLMs

from 1B to 16B and experiments on bilingual machine translation and multilin-

gual summarization benchmarks, we ﬁnd that 1) LLM ﬁnetuning follows a power-

based multiplicative joint scaling law between ﬁnetuning data size and each other

scaling factor; 2) LLM ﬁnetuning beneﬁts more from LLM model scaling than

pretraining data scaling, and PET parameter scaling is generally ineffective; and

3) the optimal ﬁnetuning method is highly task- and ﬁnetuning data-dependent.

We hope our ﬁndings could shed light on understanding, selecting and developing

LLM ﬁnetuning methods.

1 INTRODUCTION

Leveraging and transferring the knowledge encoded in large-scale pretrained models for downstream

applications has become the standard paradigm underlying the recent success achieved in various

domains (Devlin et al., 2019; Lewis et al., 2020; Raffel et al., 2020; Dosovitskiy et al., 2021; Baevski

et al., 2020), with the remarkable milestone set by large language models (LLMs) that have yielded

ground-breaking performance across language tasks (Brown et al., 2020; Zhang et al., 2022b; Scao

et al., 2022; Touvron et al., 2023). Advanced LLMs, such as GPT-4 (OpenAI, 2023) and PaLM

2 (Anil et al., 2023), often show emergent capabilities and allow for in-context learning that could

use just a few demonstration examples to perform complex reasoning and generation tasks (Wei

et al., 2022; Zhang et al., 2023; Fu et al., 2023; Shen et al., 2023). Still, LLM ﬁnetuning is required

and widely adopted to unlock new and robust capabilities for creative tasks, get the most for focused

downstream tasks, and align its value with human preferences (Ouyang et al., 2022; Yang et al.,

2023; Gong et al., 2023; Schick et al., 2023). This becomes more signiﬁcant in traditional industrial

applications due to the existence of large-scale annotated task-speciﬁc data accumulated over years.

There are many potential factors affecting the performance of LLM ﬁnetuning, including but not

limited to 1) pretraining conditions, such as LLM model size and pretraining data size; and 2) ﬁne-

tuning conditions, such as downstream task, ﬁnetuning data size and ﬁnetuning methods. Intuitively,

the pretraining controls the quality of the learned representation and knowledge in pretrained LLMs,

and the ﬁnetuning affects the degree of transfer to the donwstream task. While previous studies have

well explored the scaling for LLM pretraining or training from scratch (Kaplan et al., 2020; Hoff-

mann et al., 2022) and the development of advanced efﬁcient ﬁnetuning methods (Hu et al., 2021;

He et al., 2022), the question of whether and how LLM ﬁnetuning scales with the above factors

unfortunately receives very little attention (Hernandez et al., 2021), which is the focus of our study.

arXiv:2402.17193v1 [cs.CL] 27 Feb 2024

Published as a conference paper at ICLR 2024

Note, apart from improving ﬁnetuning performance, studying the scaling for LLM ﬁnetuning could

help us to understand the impact of different pretraining factors from the perspective of ﬁnetuning,

which may offer insights for LLM pretraining.

In this paper, we address the above question by systematically studying the scaling for two popular

ways of LLM ﬁnetuning: full-model tuning (FMT) that updates all LLM parameters and parameter-

efﬁcient tuning (PET) that only optimizes a small amount of (newly added) parameters, such as

prompt tuning (Lester et al., 2021, Prompt) and low-rank adaptation (Hu et al., 2021, LoRA). We

ﬁrst examine ﬁnetuning data scaling (Hernandez et al., 2021), on top of which we further explore

its scaling relationship with other scaling factors, including LLM model size, pretraining data size,

and PET parameter size. We focus on the data-limited regime, where the ﬁnetuning data is much

smaller than the LLM model, better reﬂecting the situation in the era of LLM. For experiments,

we pretrained two sets of bilingual LLMs (English&German, English&Chinese) with model size

ranging from 1B to 16B, and performed large-scale study on WMT machine translation (English-

German, English-Chinese) and multilingual summarization (English, German, French and Spanish)

tasks with up to 20M ﬁnetuning examples. Our main ﬁndings are summarized below:

• We propose the following multiplicative joint scaling law for LLM ﬁnetuning:

L(X, D

) = A ∗

∗

+ E, (1)

where {A, E, α, β} are data-speciﬁc parameters to be ﬁtted, D

denotes ﬁnetuning data

size, and X refer to each of the other scaling factors. We show empirical evidence that this

joint law generalizes to different settings.

• Scaling LLM model beneﬁts LLM ﬁnetuning more than scaling pretraining data.

• Increasing PET parameters doesn’t scale well for LoRA and Prompt, although LoRA shows

better training stability.

• The scaling property for LLM ﬁnetuning is highly task- and data-dependent, making the

selection of optimal ﬁnetuning method for a downstream task non-trivial.

• LLM-based ﬁnetuning could encourage zero-shot generalization to relevant tasks, and PET

performs much better than FMT.

2 SETUP

Downstream Tasks We consider machine translation and multilingual summarization as the

downstream tasks for the ﬁnetuning, because 1) these tasks require resolving cross-lingual under-

standing and generation, which represent high complexity and are challenging; and 2) they are well

established in NLP with rich amount of available ﬁnetuning corpora. Specially, we adopt WMT14

English-German (En-De) and WMT19 English-Chinese (En-Zh) (Kocmi et al., 2022) for transla-

tion. We combine the De, Spanish (Es) and French (Fr) portion of the multilingual summarization

dataset (Scialom et al., 2020) with CNN/Daily-Mail (Hermann et al., 2015, En) for summarization

and denote it as MLSum. Details about each task are listed in Table 1a. Note for MLSum, we di-

rectly concatenate the datasets of different languages for training and evaluation, where each article

is prepended a prompt indicating its language “Summarize the following document in {lang}:”.

LLMs and Preraining We adopt the exact setup as in Garcia et al. (2023) for LLM pretraining.

The model is a decoder-only Transformer with multi-query attention (Chowdhery et al., 2022) and

trained with the modiﬁed UL2 objective (Tay et al., 2022). Considering the focused downstream

tasks and also to ensure the generalization of our study, we pretrained two sets of bilingual LLMs,

i.e. En-De LLM and En-Zh LLM. The pretraining data is a mix of monolingual data from two

languages: we use En/De (En/Zh) data with about 280B (206B) tokens to pretrain the En-De (En-

Zh) LLM. We train LLMs with parameter sizes from 1B to 16B by varying model conﬁgurations as

in Table 3 and keep all other settings intact. All LLMs are optimized using Adafactor (Shazeer &

Stern, 2018) for one training epoch under a cosine learning rate decay schedule (from 0.01 to 0.001).

We refer the readers to (Garcia et al., 2023) for more details about the pretraining.

Published as a conference paper at ICLR 2024

Table 1: Setups for ﬁnetuning. “K/B/M”: thousand/billion/million; “#Train”: the number of training examples;

“Length”: maximum source/target sequence length cut at training. Note pretraining data size is for token count.

Bold numbers denote the held-out settings we leave for scaling law veriﬁcation.

(a) Details for ﬁnetuning tasks.

Task #Train Length Dev Test Zero-Shot Base LLM

WMT14 En-De 4.5M 256/256 newstest2013 newstest2020,2021,2022 Flores200 En-De LLM

WMT19 En-Zh 25M 256/256 newsdev2017 newstest2020,2021,2022 Flores200 En-Zh LLM

MLSum 1.1M 512/256 ofﬁcial dev sets ofﬁcial test sets - En-De LLM

(b) Scaling settings for different factors.

LLM Model Sizes 1B, 2B, 4B, 8B, 16B

Pretraining Data Sizes

En-De LLM 84B, 126B, 167B, 209B, 283B

En-Zh LLM 84B, 105B, 126B, 147B, 167B, 206B

PET Parameter Sizes

Prompt Length 50, 100, 150, 200, 300, 400, 600

LoRA Rank 4, 8, 16, 32, 48, 64, 128

Finetuning Data Sizes

Prompt & LoRA 8K, 10K, 20K, 30K, 40K, 50K, 60K, 70K, 80K, 90K, 100K

FMT– WMT En-De 100K, 500K, 1M, 1.5M, 2M, 2.5M, 3M, 3.5M, 4M, 4.5M

FMT– WMT En-Zh 1M, 2M, 3M, 4M, 5M, 10M, 15M, 20M, 25M

FMT– MLSum 100K, 200K, 300K, 400K, 500K, 600K, 700K, 800K, 900K

Finetuning Settings We mainly study the scaling for the following three ﬁnetuning methods:

• Full-Model Tuning (FMT): This is the vanilla way of ﬁnetuning which simply optimizes

all LLM parameters;

• Prompt Tuning (Prompt): Prompt prepends the input embedding X ∈ R

|X|×d

with a tun-

able “soft-prompt” P ∈ R

|P |×d

, and feeds their concatenation [P; X] ∈ R

(|P |+|X|)×d

LLM. |·| and d denote sequence length and model dimension, respectively. During ﬁnetun-

ing, only the prompt parameter P is optimized. We initialize P from sampled vocabulary,

and set the prompt length |P | to 100 by default (Lester et al., 2021).

• Low-Rank Adaptation (LoRA): Rather than modifying LLM inputs, LoRA updates pre-

trained model weights W ∈ R

m×n

with trainable pairs of rank decomposition matrices

B ∈ R

m×r

, A ∈ R

r×n

, and uses W + BA instead during ﬁnetuning. m, n are dimensions

and r is LoRA rank. Only Bs and As are optimized. We apply LoRA to both attention and

feed-forward layers in LLMs, and set the rank r to 4 by default (Hu et al., 2021).

We explore 4 different factors for the scaling, which are summarized in Table 1b. Except LLM model

scaling, all experiments are based on the corresponding 1B LLM. For pretraining data scaling, we

adopt intermediate pretrained checkpoints as the proxy due to computational budget constraint while

acknowledge its sub-optimality. Details for optimization are given in Appendix.

Evaluation We use the best checkpoint based on token-level perplexity (PPL) on the dev set for

evaluation. For scaling laws, we report PPL on test sets; for general generation, we use greedy

decoding, and report BLEURT (Sellam et al., 2020) and RougeL (Lin, 2004) for translation and

summarization, respectively. For zero-shot evaluation, we adopt Flores200 (NLLB Team, 2022)

and evaluate on {Fr, De, Hindi (Hi), Turkish (Tr), Polish (Po)→Zh} and {Fr, Zh, Hi, Tr, Po→De}

for En-Zh and En-De translation respectively. For scaling law evaluation, we split empirical data

points into two sets, empirical ﬁtting and held-out set, where the former is used for ﬁtting scaling

parameters and the latter is used for evaluation. We report mean absolute deviation. To reduce noise,

we perform three runs, each with a different random subset of the ﬁnetuning data, and report average

performance. When sampling for MLSum, we keep the mixing ratio over different languages ﬁxed.

3 WHY MULTIPLICATIVE JOINT SCALING LAW?

We consider 4 scaling factors in this study but jointly modeling all of them is time and resource

consuming. Instead, we treat ﬁnetuning data as the pivoting factor and perform joint scaling analysis

Published as a conference paper at ICLR 2024

Figure 1: Fitted single-variable scaling laws for ﬁnetuning data scaling over different LLM model sizes on

WMT14 En-De. Solid lines denote ﬁtted scaling curves. Filled circles and triangles denote ﬁtting and held-out

data points. ∆

: mean absolute deviation on the held-out data.

0 1 2 3 4

#Sentences (finetuning)

1e6

0.8

0.9

1.0

1.1

1.2

1.3

Test PPL

FMT,

= 0.002

1b = 0.22

2b = 0.09

4b = 0.04

8b = 0.05

0.0 0.2 0.4 0.6 0.8 1.0

#Sentences (finetuning)

1e5

Prompt,

= 0.002

1b = 0.71

2b = 0.22

4b = 0.26

8b = 0.91

0.0 0.2 0.4 0.6 0.8 1.0

#Sentences (finetuning)

1e5

LoRA,

= 0.003

0.5

1.0

1.5

Model Size

1e10

1b = 0.06

2b = 0.32

4b = 0.17

8b = 0.83

Table 2: Held-out ﬁtting errors (↓) for the additive and multiplicative scaling formulation over different ﬁne-

tuning methods on WMT14 En-De. Multiplicative scaling law generalizes better.

Scaling Factor

Multiplicative Additive

FMT Prompt LoRA Avg FMT Prompt LoRA Avg

LLM Model Size 0.0052 0.0043 0.0047 0.0048 0.012 0.0076 0.0045 0.0079

Pretraining Data Size 0.0057 0.0061 0.0084 0.0068 0.0048 0.0075 0.0082 0.0069

PET parameter size - 0.005 0.0031 0.004 - 0.0069 0.0032 0.005

between it and every other factor separately. Below, we start with ﬁnetuning experiments for FMT,

Prompt and LoRA on WMT14 En-De, and then explore the formulation for the joint scaling.

Finetuning data scaling follows a power law. We ﬁrst examine the scaling over ﬁnetuning data

size for each LLM model size independently, with a single variable formulation:

L(D

) =

E. Following Hoffmann et al. (2022), we estimate {A, β, E} using the Huber loss (δ = 0.001) and

the L-BFGS algorithm, and select the best ﬁt from a grid of initializations. Figure 1 shows that the

above formulation well describes LLM ﬁnetuning data scaling with small predictive errors across

model sizes and methods, echoing with the ﬁndings of Hernandez et al. (2021). Such scaling trend

also implies that, while ﬁnetuning with small amount of examples could achieve decent results (Zhou

et al., 2023; Gao et al., 2023), larger scale ﬁnetuning data still contributes to improved downstream

performance, especially when the downstream application is well deﬁned.

Additive or multiplicative joint scaling law for LLM ﬁnetuning? Figure 1 also shows some

scaling pattern over LLM model sizes, suggesting the existence of a joint scaling law. We explore

two formulations: multiplicative as in Eq. (1) and additive:

L(X, D

) =

+ E (Hoff-

mann et al., 2022), and compare them via empirical experiments.

In both formulations, α and β reﬂect the impact of factor X and ﬁnetuning data size on the perfor-

mance, respectively, which are factor-speciﬁc. E is a model- and task-dependent term, describing

irreducible loss (Ghorbani et al., 2021). We notice that the meaning for β and E generalizes over

different factors X, and thus propose to estimate them ﬁrst based on results for both LLM model and

pretraining data scaling.

Such joint ﬁtting could also reduce overﬁtting and improve extrapolation

ability. We apply the following joint ﬁtting loss:

min

,α

,β,e

run i in factor X

Huber





, D

, b

, α

, β, e



− L



, (2)

For LLM model scaling, we omitted the newly added parameters in PET because 1) the added parameters

only take a very tiny proportion, and 2) the proportion across LLM model sizes is similar. Take the 1B LLM

as example. |P | = 100 in Prompt adds 0.017% parameters; r = 4 in LoRA adds 0.19% parameters. We also

explored different formulations for the new parameters for PET, which don’t make a substantial difference.

We didn’t consider PET parameter scaling when estimating β and E because this scaling is pretty weak

and ineffective, as shown in Section 4.

剩余19页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2275
资源: 2353

大规模语言模型微调中不同数据与方法对性能的影响研究

大模型指令微调概述，大模型微调简单介绍ppt

数据筛选方法改进大型语言模型指令微调效果

大型语言模型指令数据质量评估与选择方法的研究

大规模语言模型(LLMs)的微调技术和实践指南

Python-大规模transformer语言模型包括BERT

开源的中文大语言模型技术合集

大型语言模型 (LLM)全解读.pdf

自然语言处理用的二分类微调数据SST，可以参考huggingface来具体操作训练

大模型Text-to-SQL微调的项目

ChatGPT技术与语言模型微调的关系分析.docx

基于transformer从0开始训练中文对话式大语言模型.zip

大数据集上监督微调数据选择方法的重新思考

CodePMP：一种基于大规模代码预训练偏好模型提升大型语言模型推理能力的方法

数据规模缩小 200 倍！超低训练成本的指令微调，完美复刻大模型

最新《弱监督预训练语言模型微调》报告

大语言模型的主要技术路线

大语言模型原理.docx

alpaca中文指令微调数据集.zip

大模型微调-基于Multi-GPU+FP16微调BERT大语言模型-附项目源码-优质项目实战.zip

大模型微调经典论文Qlora

数据规模缩小 200 倍！超低训练成本的指令微调，完美复刻大模型.pdf

基于深度学习的语言模型研究进展.pdf

ChatGLM大语言模型

《ChatGPT原理与实战：大型语言模型的算法、技术和私有化》.zip

文本到语音生成库：+1100种语言的预训练模型 用于任何语言训练新模型和微调现有模型的工具 用于数据集分析和管理的实用程序

phi3 微调的简单数据集，可以作为微调测试使用

使用qlora对中文大语言模型进行微调，包含ChatGLM、Chinese-LLaMA-Alpaca、BELLE.zip

中文对话模型中文OpenLLaMA模型NLP预训练_指令微调数据集

大型语言模型 LLM：2023 年完整指南.pdf

最新资源

文本到语音生成库：+1100种语言的预训练模型用于任何语言训练新模型和微调现有模型的工具用于数据集分析和管理的实用程序