Fine-TuningLanguageModelsfromHumanPreferences.pdf资源-CSDN文库

版权申诉

5星 · 超过95%的资源 195 浏览量 2023-06-04 17:12:33 上传评论收藏 1.49MB PDF 举报

资源推荐

资源详情

资源评论

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler

∗

Nisan Stiennon

∗

Jeffrey Wu Tom B. Brown

Alec Radford Dario Amodei Paul Christiano Geoffrey Irving

OpenAI

{dmz,nisan,jeffwu,tom,alec,damodei,paul,irving}@openai.com

Abstract

Reward learning enables the application of rein-

forcement learning (RL) to tasks where reward is

deﬁned by human judgment, building a model of

reward by asking humans questions. Most work

on reward learning has used simulated environ-

ments, but complex information about values is of-

ten expressed in natural language, and we believe

reward learning for language is a key to making

RL practical and safe for real-world tasks. In this

paper, we build on advances in generative pretrain-

ing of language models to apply reward learning

to four natural language tasks: continuing text

with positive sentiment or physically descriptive

language, and summarization tasks on the TL;DR

and CNN/Daily Mail datasets. For stylistic con-

tinuation we achieve good results with only 5,000

comparisons evaluated by humans. For summa-

rization, models trained with 60,000 comparisons

copy whole sentences from the input but skip irrel-

evant preamble; this leads to reasonable ROUGE

scores and very good performance according to

our human labelers, but may be exploiting the fact

that labelers rely on simple heuristics.

1. Introduction

We would like to apply reinforcement learning to complex

tasks deﬁned only by human judgment, where we can only

tell whether a result is good or bad by asking humans. To

do this, we can ﬁrst use human labels to train a model of

reward, and then optimize that model. While there is a long

history of work learning such models from humans through

interaction, this work has only recently been applied to mod-

ern deep learning, and even then has only been applied to

relatively simple simulated environments (Christiano et al.,

2017; Ibarz et al., 2018; Bahdanau et al., 2018). By contrast,

real world settings in which humans need to specify com-

Equal contribution. Correspondence to paul@openai.com.

plex goals to AI agents are likely to both involve and require

natural language, which is a rich medium for expressing

value-laden concepts. Natural language is particularly im-

portant when an agent must communicate back to a human

to help provide a more accurate supervisory signal (Irving

et al., 2018; Christiano et al., 2018; Leike et al., 2018).

Natural language processing has seen substantial recent ad-

vances. One successful method has been to pretrain a large

generative language model on a corpus of unsupervised data,

then ﬁne-tune the model for supervised NLP tasks (Dai and

Le, 2015; Peters et al., 2018; Radford et al., 2018; Khandel-

wal et al., 2019). This method often substantially outper-

forms training on the supervised datasets from scratch, and

a single pretrained language model often can be ﬁne-tuned

for state of the art performance on many different super-

vised datasets (Howard and Ruder, 2018). In some cases,

ﬁne-tuning is not required: Radford et al. (2019) ﬁnd that

generatively trained models show reasonable performance

on NLP tasks with no additional training (zero-shot).

There is a long literature applying reinforcement learning to

natural language tasks. Much of this work uses algorithmi-

cally deﬁned reward functions such as BLEU for translation

(Ranzato et al., 2015; Wu et al., 2016), ROUGE for summa-

rization (Ranzato et al., 2015; Paulus et al., 2017; Wu and

Hu, 2018; Gao et al., 2019b), music theory-based rewards

(Jaques et al., 2017), or event detectors for story generation

(Tambwekar et al., 2018). Nguyen et al. (2017) used RL

on BLEU but applied several error models to approximate

human behavior. Wu and Hu (2018) and Cho et al. (2019)

learned models of coherence from existing text and used

them as RL rewards for summarization and long-form gen-

eration, respectively. Gao et al. (2019a) built an interactive

summarization tool by applying reward learning to one ar-

ticle at a time. Experiments using human evaluations as

rewards include Kreutzer et al. (2018) which used off-policy

reward learning for translation, and Jaques et al. (2019)

which applied the modiﬁed Q-learning methods of Jaques

et al. (2017) to implicit human preferences in dialog. Yi

et al. (2019) learned rewards from humans to ﬁne-tune dia-

log models, but smoothed the rewards to allow supervised

learning. We refer to Luketina et al. (2019) for a survey of

arXiv:1909.08593v2 [cs.CL] 8 Jan 2020

http://chat.xutongbao.top

Fine-Tuning Language Models from Human Preferences

RL tasks involving language as a component, and for RL

results using transfer learning from language. RL is not the

only way to incorporate ongoing human feedback: Hancock

et al. (2019) ask humans what a dialogue system should

have said instead, then continue supervised training.

In this paper, we combine the pretraining advances in natural

language processing with human preference learning. We

ﬁne-tune pretrained language models with reinforcement

learning rather than supervised learning, using a reward

model trained from human preferences on text continua-

tions. Following Jaques et al. (2017; 2019), we use a KL

constraint to prevent the ﬁne-tuned model from drifting too

far from the pretrained model. We apply our method to

two types of tasks: continuing text in a way that matches a

target style, either positive sentiment or vividly descriptive,

and summarizing text from the CNN/Daily Mail or TL;DR

datasets (Hermann et al., 2015; Völske et al., 2017). Our

motivation is NLP tasks where supervised data sets are un-

available or insufﬁcient, and where programmatic reward

functions are poor proxies for our true goals.

For stylistic continuation, 5,000 human comparisons (each

choosing the best of 4 continuations) result in the ﬁne-tuned

model being preferred by humans 86% of the time vs. zero-

shot and 77% vs. ﬁne-tuning to a supervised sentiment net-

work. For summarization, we use 60,000 human samples

to train models that can roughly be described as “smart

copiers”: they typically copy whole sentences from the in-

put, but vary what they copy to skip irrelevant initial text.

This copying behavior emerged naturally from the data col-

lection and training process; we did not use any explicit

architectural mechanism for copying as in See et al. (2017);

Gehrmann et al. (2018). One explanation is that copying

is an easy way to be accurate, given that we did not in-

struct labelers to penalize copying but do instruct them to

penalize inaccuracy. It may also reﬂect the fact that some

labelers check for copying as a fast heuristic to ensure a

summary is accurate. Indeed, human labelers signiﬁcantly

prefer our models to supervised ﬁne-tuning baselines and

even to human-written reference summaries, but not to a

lead-3 baseline which copies the ﬁrst three sentences.

For summarization, we continue to collect additional data

and retrain our reward model as the policy improves (online

data collection). We also test ofﬂine data collection where

we train the reward model using data from the original

language model only; ofﬂine data collection signiﬁcantly re-

duces the complexity of the training process. For the TL;DR

dataset, human labelers preferred the policy trained with

online data collection 71% of the time, and in qualitative

evaluations the ofﬂine model often provides inaccurate sum-

maries. In contrast, for stylistic continuation we found that

ofﬂine data collection worked similarly well. This may be

related to the style tasks requiring very little data; Radford

Reward model training

Policy

Reward

model

Human

labeler

context continuation (×4) reward (×4) loss

label

Policy training

Policy

Reward

model

context continuation reward loss

Figure 1: Our training processes for reward model and

policy. In the online case, the processes are interleaved.

et al. (2017) show that generatively trained models can learn

to classify sentiment from very few labeled examples.

In concurrent work, Böhm et al. (2019) also use human

evaluations to learn a reward function for summarization,

and optimize that reward function with RL. Their work

provides a more detailed investigation of the learned policy

and reward function on the CNN/Daily Mail dataset, while

we are interested in exploring learning from human feedback

more generally and at larger computational scale. So we

consider several additional tasks, explore the effects of on-

policy reward model training and more data, and ﬁne-tune

large language models for both reward modeling and RL.

2. Methods

We begin with a vocabulary

and a language model

which deﬁnes a probability distribution over sequences of

tokens Σ

via

ρ(x

· · · x

n−1

) =

0≤k<n

ρ(x

· · · x

k−1

)

We will apply this model to a task with input space

X =

≤m

, data distribution

over

, and output space

Y =

. For example,

x ∈ X

could be an article of up to

1000 words and

y ∈ Y

could be a 100-word summary.

deﬁnes a probabilistic policy for this task via

ρ(y|x) =

ρ(xy)/ρ(x)

: ﬁxing the beginning of the sample to

and

generating subsequent tokens using ρ.

We initialize a policy

π = ρ

, and then ﬁne-tune

to perform

the task well using RL. If the task was deﬁned by a reward

function

r : X × Y → R

, then we could use RL to directly

optimize the expected reward:

[r] = E

x∼D,y∼π(·|x)

[r(x, y)]

However, we want to perform tasks deﬁned by human judg-

ments, where we can only learn about the reward by asking

humans. To do this, we will ﬁrst use human labels to train a

reward model, and then optimize that reward model.

http://chat.xutongbao.top

Fine-Tuning Language Models from Human Preferences

Following Christiano et al. (2017), we ask human labelers

to pick which of several values of

is the best response to

a given input

We ask humans to choose between four

options

, y

)

; considering more options allows a

human to amortize the cost of reading and understanding

the prompt

. Let

b ∈ {0, 1, 2, 3}

be the option they select.

Having collected a dataset

(x, y

, y

, b)

tuples,

we ﬁt a reward model r : X × Y → R using the loss

loss(r) = E

(

x,{y

}

)

∼S



log

r(x,y

)

r(x,y

)



(1)

Since the reward model needs to understand language, we

initialize it as a random linear function of the ﬁnal em-

bedding output of the language model policy

following

Radford et al. (2018) (see section 4.2 for why we initialize

from

rather than

). To keep the scale of the reward model

consistent across training, we normalize it so that it has

mean 0 and variance 1 for x ∼ D, y ∼ ρ(·|x).

Now we ﬁne-tune

to optimize the reward model

. To

keep

from moving too far from

, we add a penalty with

expectation

β KL(π, ρ)

(see table 10 for what happens with-

out this). That is, we perform RL on the modiﬁed reward

R(x, y) = r(x, y) − β log

π(y|x)

ρ(y|x)

. (2)

We either choose a constant

or vary it dynamically to

achieve a particular value of

KL(π, ρ)

; see section 2.2. This

term has several purposes: it plays the role of an entropy

bonus, it prevents the policy from moving too far from

the range where

is valid, and in the case of our style

continuation tasks it also is an important part of the task

deﬁnition: we ask humans to evaluate style, but rely on the

KL term to encourage coherence and topicality.

Our overall training process is:

Gather samples

(x, y

, y

)

via

x ∼ D, y

∼

ρ(·|x). Ask humans to pick the best y

from each.

Initialize

, using random initialization for the

ﬁnal linear layer of

. Train

on the human samples

using loss (1).

Train

via Proximal Policy Optimization (PPO, Schul-

man et al. (2017)) with reward R from (2) on x ∼ D.

In the online data collection case, continue to collect

additional samples, and periodically retrain the reward

model r. This is described in section 2.3.

In early experiments we found that it was hard for humans

to provide consistent ﬁne-grained quantitative distinctions when

asked for an absolute number, and experiments on synthetic tasks

conﬁrmed that comparisons were almost as useful.

2.1. Pretraining details

We use a 774M parameter version of the GPT-2 language

model in Radford et al. (2019) trained on their WebText

dataset and their 50,257 token invertible byte pair encoding

to preserve capitalization and punctuation (Sennrich et al.,

2015). The model is a Transformer with 36 layers, 20 heads,

and embedding size 1280 (Vaswani et al., 2017).

For stylistic continuation tasks we perform supervised ﬁne-

tuning of the language model to the BookCorpus dataset

of Zhu et al. (2015) prior to RL ﬁne-tuning; we train from

scratch on WebText, supervised ﬁne-tune on BookCorpus,

then RL ﬁne-tune to our ﬁnal task. To improve sample

quality, we use a temperature of

T < 1

for all experiments;

we modify the initial language model by dividing logits by

, so that future sampling and RL with

T = 1

corresponds

to a lower temperature for the unmodiﬁed pretrained model.

2.2. Fine-tuning details

Starting with the pretrained language model, the reward

model is trained using the Adam optimizer (Kingma and Ba,

2014) with loss (1). The batch size is 8 for style tasks and

32 for summarization, and the learning rate is

1.77 × 10

−5

for both. We use a single epoch to avoid overﬁtting to the

small amount of human data, and turn off dropout.

For training the policy

, we use the PPO2 version of Proxi-

mal Policy Optimization from Dhariwal et al. (2017). We

use 2M episodes (

x, y

pairs),

γ = 1

, four PPO epochs per

batch with one minibatch each, and default values for the

other parameters. We use batch size 1024 for style tasks and

512 for summarization. We do not use dropout for policy

training. The learning rate was

1.41 × 10

−5

for style tasks

and 7.07 × 10

−6

for summarization.

Models trained with different seeds and the same KL penalty

sometimes end up with quite different values of

KL(π, ρ)

making them hard to compare. To ﬁx this, for some experi-

ments we dynamically vary

to target a particular value of

KL(π, ρ) using the log-space proportional controller

= clip



KL(π

, ρ) − KL

target

, −0.2, 0.2



t+1

= β

(1 + K

)

We used K

= 0.1.

For supervised ﬁne-tuning baselines, we ﬁne-tune for 1

epoch on the CNN/Daily Mail and TL;DR training sets (for

TL;DR we removed 30K examples to serve as a validation

set). We decayed the learning rate to 0 with a cosine sched-

ule; for the initial value, we swept over 8 log-linearly spaced

options between

−4

and

3 × 10

−4

. We also experimented

with different dropout rates, and found a rate of 0.1 to work

best. We then chose the model with the best validation loss.

http://chat.xutongbao.top

Fine-Tuning Language Models from Human Preferences

2.3. Online data collection

If the trained policy

is very different from the zero-shot

policy

, the reward model will suffer a large distributional

shift from training on samples from

to evaluation on sam-

ples from

. To prevent this, we can collect human data

throughout RL ﬁne-tuning, continuously gathering new data

by sampling from

and retraining the reward model. As

section 3 shows, online data collection was important for

summarization but not for the simpler style tasks.

In the online case, we will choose a function

l(n)

describing

how many labels we want before beginning the

PPO

episode. Let

= 2 × 10

be the total number of PPO

episodes,

= l(0)

be an initial number of human labels,

and N

be the total number of human labels. We take

l(n) = N

+ (N

− N

)



1 − (1 − n/N

)



We pause before the

PPO episode if we have fewer than

l(n)

labels. We send another batch of requests to the labelers

if the total requests so far is less than

l(n) + 1000

, to ensure

they have at least 1000 outstanding queries at any time. We

train the reward model before the ﬁrst PPO episode, and

then retrain it 19 more times at evenly spaced values of

l(n)

Each time we retrain we reinitialize

to a random linear

layer on top of

and do a single epoch through the labels

collected so far. The ofﬂine case is the limit N

= N

To estimate overall progress, we gather validation samples

consisting of

x ∼ D; y

, y

∼ ρ(·|x); y

, y

∼ π(·|x)

a constant rate; human labels on these give how often

beats

. Since validation samples are only used to evaluate

the current

, we can add them to the training set for

In order to estimate inter-labeler agreement, 5% of queries

are answered 5 times by different labelers. Label counts in

section 3 include validation samples and repeated labels.

2.4. Human labeling

We use Scale AI to collect labels. The Scale API accepts

requests of the form

(x, y

, y

)

and returns selections

b ∈ {0, 1, 2, 3}

. We describe the task to Scale through a

combination of instructions (appendix A) and a dataset of

about 100 example comparisons from the authors.

Unlike many tasks in ML, our queries do not have unam-

biguous ground truth, especially for pairs of similar outputs

(which play a large role in our training process, since we

train

on pairs of labels sampled from a single policy

This means that there is signiﬁcant disagreement even be-

tween labelers who have a similar understanding of the task

and are trying to rate consistently. On 4-way comparisons

for sentiment and TL;DR summarization, authors of this

paper agree about 60% of the time (vs. 25% for random

guessing). This low rate of agreement complicates the qual-

ity control process for Scale; the authors agree with Scale

Figure 2: Learning curves for a 124M-parameter model with

mock sentiment reward, targeting a KL of 8 nats. Lines and

shaded areas show mean and range for 5 seeds. Early on the

reward model sometimes speeds up training, a phenomenon

also observed by Christiano et al. (2017).

labelers 38% of the time on sentiment and 46% of the time

on TL;DR summarization. We give further details of the

human data collection and quality evaluation in appendix B.

For ﬁnal evaluation of two models

and

, we generate

either 2-way comparisons between pairs

(a ∼ A, b ∼ B)

4-way comparisons with quadruples

, a

∼ A, b

, b

∼

, randomize the order in which samples are presented,

and present these comparisons to Scale. Evaluating the

quality of a model trained by Scale using the same set of

humans from Scale is perilous: it demonstrates that

and

have succeeded in ﬁtting to the human reward, but does

not show that those human evaluations capture what we

really care about, and our models are incentivized to exploit

idiosyncracies of the labeling process. We include samples

from our models so that readers can judge for themselves.

3. Experiments

In section 3.1.1, we test our approach to RL ﬁne-tuning of

language models by using a mock labeler (a sentiment model

trained on a review classiﬁcation problem) as a stand-in for

human labels. We show that RL ﬁne-tuning is effective

at optimizing this complex but somewhat artiﬁcial reward.

In section 3.1.2, we show that we can optimize language

models from human preferences on stylistic continuation

tasks (sentiment and physical descriptiveness) with very

little data, and that in the sentiment case the results are

preferred to optimizing the review sentiment model. In

section 3.2 we apply RL ﬁne-tuning to summarization on

the CNN/Daily Mail and TL;DR datasets, show that the

resulting models are essentially “smart copiers”, and discuss

these results in the context of other summarization work.

http://chat.xutongbao.top

Fine-Tuning Language Models from Human Preferences

Figure 3: Allowing the policy

to move further from the initial policy

as measured by

KL(π, ρ)

achieves higher reward at

the cost of less natural samples. Here we show the optimal KL vs. reward for 124M-parameter mock sentiment (as estimated

by sampling), together with results using PPO. Runs used 2M episodes, except for the top series.

We release code

for reward modeling and ﬁne-tuning in

the ofﬂine data case. Our public version of the code only

works with a smaller 124M parameter model with 12 layers,

12 heads, and embedding size 768. We include ﬁne-tuned

versions of this smaller model, as well as some of the human

labels we collected for our main experiments (note that these

labels were collected from runs using the larger model).

3.1. Stylistic continuation tasks

We ﬁrst apply our method to stylistic text continuation tasks,

where the policy is presented with an excerpt from the Book-

Corpus dataset (Zhu et al., 2015) and generates a continu-

ation of the text. The reward function evaluates the style

of the concatenated text, either automatically or based on

human judgments. We sample excerpts with lengths of 32

to 64 tokens, and the policy generates 24 additional tokens.

We set the temperature of the pretrained model to

T = 0.7

as described in section 2.1.

3.1.1. MOCK SENTIMENT TASK

To study our method in a controlled setting, we ﬁrst apply it

to optimize a known reward function

designed to reﬂect

some of the complexity of human judgments. We construct

Code at https://github.com/openai/lm-human-preferences.

by training a classiﬁer

on a binarized, balanced subsam-

ple of the Amazon review dataset of McAuley et al. (2015).

The classiﬁer predicts whether a review is positive or nega-

tive, and we deﬁne

(x, y)

as the classiﬁer’s log odds that

a review is positive (the input to the ﬁnal sigmoid layer).

Optimizing

without constraints would lead the policy

to produce incoherent continuations, but as described in

section 2.2 we include a KL constraint that forces it to stay

close to a language model ρ trained on BookCorpus.

The goal of our method is to optimize a reward function

using only a small number of queries to a human. In this

mock sentiment experiment, we simulate human judgments

by assuming that the “human” always selects the continu-

ation with the higher reward according to

, and ask how

many queries we need to optimize r

Figure 2 shows how

evolves during training, using either

direct RL access to

or a limited number of queries to train

a reward model. 20k to 60k queries allow us to optimize

nearly as well as using RL to directly optimize r

Because we know the reward function, we can also ana-

lytically compute the optimal policy and compare it to our

learned policies. With a constraint on the KL divergence

The model is a Transformer with 6 layers, 8 attention heads,

and embedding size 512.

http://chat.xutongbao.top

剩余25页未读，继续阅读

评论收藏

内容反馈

版权申诉

李需要

2023-07-28

感谢大佬分享的资源，对我启发很大，给了我新的灵感。

地理探险家

粉丝: 1054
资源: 5416

Fine-Tuning Language Models from Human Preferences.pdf

最新资源

Fine-Tuning Language Models from Human Preferences.pdf

微调fine-tuning.pdf

AIX-Performance-Tuning-VUG-May2418.pdf

PyPI 官网下载 | chess-tuning-tools-0.7.0b2.tar.gz

VITS-fast-fine-tuning训练准备的样例数据，可以快速体验该模型的语音合成效果

openai/chatgpt微调/fine-tuning/测试用/投喂资源

hotspot-virtual-machine-garbage-collection-tuning-guide.pdf

VITS-fast-fine-tuning训练准备的样例数据，内容包含预训练模型、配置文件、语音素材等

Speedemy-MySQL-Configuration-Tuning-Handbook.pdf

qorvo-antenna-aperture-tuning-eguide_cn.pdf

GMW 14379-2021 Tuning Element Condensate Expulsion Test.pdf

论文研究-Fine-tuning and visualization of Convolutional Neural Networks.pdf

2-day-performance-tuning-guide.pdf

Prompt tuning新工作，五个参数解决下游任务 fine-tuning .pdf

DNA-tuning汽车ECU改装公司简介.pdf

基于LoRA和 P-Tuning v2 的ChatGLM-6B高效参数微调python源码.zip

Audio-Tuning-Tool-exe-v2.2052.rar

俄语文本摘要的GPT-3微调_Fine-tuning GPT-3 for Russian Text Summarization

DNA-tuning汽车ECU改装公司简介.docx

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

李飞飞自传 我看见的世界 The World I see

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

4个亲测好用的ChatGPT4渠道

农村公交与异构无人机协同配送优化

学术海报模板+论文科研+研究生

北森能力测评题库.zip

最新资源

李飞飞自传我看见的世界 The World I see