大模型响应嵌入增强大型语言模型的人类偏好对齐

需积分: 5 135 浏览量 2024-10-05 19:58:15 上传评论收藏 3.24MB PDF 举报

资源推荐

资源详情

资源评论

REAL: Response Embedding-based Alignment for LLMs

Honggen Zhang

, Igor Molybog

, June Zhang

, Xufeng Zhao

Univerity of Hawaii at Manoa,

University of Hamburg,

Email:{honggen,molybog,zjz}@hawaii.edu{xufeng.zhao}@uni-hamburg.de

Abstract

Aligning large language models (LLMs) to hu-

man preferences is a crucial step in building

helpful and safe AI tools, which usually in-

volve training on supervised datasets. Popular

algorithms such as Direct Preference Optimiza-

tion rely on pairs of AI-generated responses

ranked according to human feedback. The la-

beling process is the most labor-intensive and

costly part of the alignment pipeline, and im-

proving its efﬁciency would have a meaning-

ful impact on AI development. We propose a

strategy for sampling a high-quality training

dataset that focuses on acquiring the most in-

formative response pairs for labeling out of a

set of AI-generated responses. Experimental

results on synthetic HH-RLHF benchmarks in-

dicate that choosing dissimilar response pairs

enhances the direct alignment of LLMs while

reducing inherited labeling errors. We also

applied our method to the real-world dataset

SHP2, selecting optimal pairs from multiple

responses. The model aligned on dissimilar

response pairs obtained the best win rate on

the dialogue task. Our ﬁndings suggest that

focusing on less similar pairs can improve

the efﬁciency of LLM alignment, saving up

65%

of annotators’ work. The code of

the work can be found

https://github.com/

honggen-zhang/REAL-Alignment

1 Introduction

Large Language models (LLMs), empowered by

the enormous pre-trained dataset from the Internet,

show the power to generate the answers to vari-

ous questions and solutions to challenging tasks.

However, they might generate undesirable content

that is useless or even harmful to humans (Wang

et al., 2024). Additional training steps are required

to optimize LLMs and, thus, align their responses

with human preferences. For that purpose, Chris-

tiano et al. (2017) proposed Reinforcement learning

Corresponding Author

from human feedback (RLHF). It consists of esti-

mating the human preference reward model (RM)

from response preference data (Ouyang et al., 2022)

and steering LLM parameters using a popular re-

inforcement learning algorithm of proximal pol-

icy optimization (PPO). RLHF requires extensive

computational resources and is prone to training

instabilities.

Recently, direct alignment from preference

(DAP) approach, which does not explicitly learn

the reward model, has emerged as an alternative to

RLHF (Zhao et al., 2022; Song et al., 2023; Zhao

et al., 2023; Xu et al., 2023; Rafailov et al., 2024).

Direct Preference Optimization (DPO) (Rafailov

et al., 2024; Azar et al., 2023) is a milestone

DAP method. It formulates the problem of learn-

ing human preferences through ﬁnetuning LLM

with implicit reward model using a training set

D = {x

, y

−

}

i=1

, where

is the

th prompt,

, y

−

are the corresponding preferred and not-

preferred responses.

DPO requires the explicit preference signal from

an annotator within the dataset. Some DPO varia-

tions such as Contrastive Post-training (Xu et al.,

2023), RPO(Song et al., 2023) were proposed to

augment

using AI but might generate low-quality

pairs. The labeling rough estimate is $0.1 - $1

per prompt; with 100,000-1,000,000 prompts, it

would cost approximately $100,000 to augment the

dataset. Some other DPO variation methods(Guo

et al., 2024; Yu et al., 2024) actively choose better

samples using additional annotators at the cost of

increased computations.

In this paper, we propose a novel method for en-

hancing DPO learning with efﬁcient data selection

(see Fig. 1). We should only train DPO on the most

informative subset of samples in

. Inspired by

works in contrastive learning(Chen et al., 2020), we

connect the usefulness of response pair

, y

)

the cosine similarity between their representations

in the embedding space. Sampling similar pairs

arXiv:2409.17169v1 [cs.CL] 17 Sep 2024

(large cosine similarity) will bring a harder prob-

lem for learning, which is usually encouraged in

contrastive learning. However, they are more prone

to erroneous labeling (i.e., non-preferred responses

being labeled as preferred and vice versa) which

might damp this effect (Zhang et al., 2024; Chuang

et al., 2020). On the other hand, dissimilar pairs can

be preferred for the DPO owing to smaller noise

in the labels. We select similar and dissimilar re-

sponse pairs from HH-RLHF dataset and show that

dissimilar pairs empirically form the best dataset

for alignment when compared on several metrics

to randomly selected or similar pairs.

We extended this methodology to a real-world

dataset SHP2, which contains multiple responses

per prompt. We want to extract a high-quality pair

, y

)

to label from

, y

, · · · , y

}

. In addition

to considering the most similar and most dissimilar

pairs of responses, we implemented an approach

that splits the responses into two clusters and se-

lects centroids of the clusters as the training pairs.

Although more difﬁcult to implement, this method

demonstrates the best performance according to

our experimental results, with dissimilar pairs be-

ing the close second compared to similar or random

pairs.

We highlight the overlooked importance of

sentence embeddings in LLM training: The

model learning can be enhanced by investigat-

ing the sentence embeddings and integrating

this information into the ﬁne-tuning process.

We introduce efﬁcient response pair selection

strategies to acquire high-quality data, main-

taining an ofﬂine dataset throughout the train-

ing to conserve sampling resources.

Our experiments demonstrate that pairs dis-

similar in the embedding space align better

with human preferences than random or simi-

lar pairs, owing to reduced errors in the label.

2 Related work

Direct Alignment of Language Models: Despite

RLHF’s effectiveness in aligning language mod-

els (LMs) with human values, its complexity and

resource demands have spurred the exploration

of alternatives. Sequence Likelihood Calibration

(SLiC)(Zhao et al., 2022) is a DAP method to

directly encourage the LLMs to output the posi-

tive response and penalize the negative response.

Chain of Hindsight (CoH) (Liu et al., 2023) is

equivalent to learning a conditional policy. DPO

(Rafailov et al., 2024) directly optimizes LMs using

a preference-based loss function to enhance train-

ing stability in comparison to traditional RLHF.

DPO with Dynamic

(Wu et al., 2024) introduced

a framework that dynamically calibrates

at the

batch level, informed by the underlying preference

data. Existing work(Azar et al., 2023) identiﬁed

that DPO were susceptible to overﬁtting and intro-

duced Identity Preference Optimization (IPO) as

a solution to this issue. The generative diversity

of LLM deteriorated and the KL divergence grew

faster for less preferred responses compared with

preferred responses, and they proposed token-level

DPO (TDPO)(Zeng et al., 2024) to enhance the

regulation of KL divergence.

Data Quality in Direct Alignment Due to

the large data needed for the Direct Alignment,

PRO(Song et al., 2023) proposed preference rank-

ing with listwise preference datasets, which could

directly implement alignment in the ﬁne-tuning

process. However, ﬁne-tuning was constrained

by the limitations of available data and the imper-

fections inherent in human-generated data. Con-

trastive Post-training (Xu et al., 2023) tries to build

more datasets using other LLMs to advance the

training process of DPO without considering the

mistakes. Recently, similar to active learning to

select samples based on current models (Settles,

2009), (Guo et al., 2024; Morimura et al., 2024)

use the AI as an annotator to monitor the quantity

of data pairs for each training step but it will be

expensive. (Yu et al., 2024) use LLMs to design

a reﬁnement function, which estimates the qual-

ity of positive and negative responses. LESS (Xia

et al., 2024) is an optimizer-aware and practically

efﬁcient algorithm to estimate data inﬂuences on

gradient Similarity Search for instruction data se-

lection. However, this data selection is online so

needs more computation.

3 Background

LLM alignment refers to the process of training a

language model to assign a higher probability to

a response

with a higher human preference re-

ward

. An estimation

of the reward is used in

practice. It is imperative to ensure that for a given

prompt

the estimated reward

r(x, y)

is close to

the true reward

for each response. The prob-

lem of learning the best estimator for the reward

𝑥": A dog is…?

⋮

Training reference LLM

…

𝑦

DPO Loss

⋮

𝑦

Easy

Hard

𝑥

𝑦

≻

Human or AI labeling

with less error

Response Pairs

Response Embedding

The Base LLM

Figure 1: The diagram of our data selecting for DPO alignment. We extract the embeddings of the responses from

the base model. Selecting a sub-set of dissimilar pairs (easy) to label. The easy pairs will be used to directly align

the LLMs.

function can be formulated as

r = arg min

′

(x,y,R

)∼R

[(r

′

(x, y) −R

)

] (1)

where

is the dataset consisting of the prompt

responses y, and true reward values R

3.1 LLM Alignment to Human Feedback

The human feedback rarely comes in the form of

the true reward samples

(y, R

Ranking or pair-

wise comparison of responses is more common.

There are two sample responses

, y

−

)

that cor-

respond to a single prompt

in the pairwise com-

parison case. Human subjects provide a preference

to label them as

> R

−

where

and

−

are the preferred and non-preferred responses,

respectively. This method of labeling does not ex-

plicitly provide the true reward signal. However,

alignment can still be performed by applying the

Reinforcement Learning from Human Feedback

(RLHF) algorithm using binary human preference

data. The Bradley-Terry(Bradley and Terry, 1952)

model deﬁnes the log-odds

log

∗

1 − p

∗

= r(x, y

) − r(x, y

−

) (2)

Where

is the preference probability. Modeling

the margins of the response pair can also be viewed

in the perspective of estimating the true reward

in Eq. 1. Assume the ground truth reward for

r(x, y

)

and

r(x, y

−

)

is 1 and 0 respectively. The

difference between the estimated reward and the

truth is

q(y

)

[r(x, y

) − 1] + E

q(y

−

)

[r(x, y

) −

0] = E[r(x, y

) − r(x, y

−

) − 1].

We, therefore, have the preference model.

p(y

> y

−

|x) =

exp(r(x, y

))

exp(r(x, y

)) + exp(r(x, y

−

))

(3)

where

r(·)

is the reward model. In practice, we

will learn the parametrized reward model given the

human-labeled preference data. The reward model

learned from the human preference responses can

be used to score LLM-generated content. It pro-

vides feedback to the language model

by maxi-

mizing the objective function

max

x∼D,y∼π

(y|x)

[r(x, y)] (4)

− βD

[π

(y|x)||π

ref

(y|x)]

where

ref

is the reference model after the super-

vised ﬁne-tunring.

3.2 Direct Language Model Alignment

RLHF is expensive, and we cannot guarantee that

the reward model will be optimal. Recently, Direct

alignment from preferences methods have emerged

to replace the RLHF when aligning the LLMs. Di-

rect Preference Optimization(DPO)(Rafailov et al.,

2024) optimizes the policy

directly as an alter-

native to reward model

. Given the static dataset

D = {x

, y

−

}

i=1

sampled from the human

preference distribution p, the objective becomes:

DP O

= − E

x,y

−



log σ



β log

|x)

ref

|x)

(5)

−β log

−

|x)

ref

−

|x)



剩余12页未读，继续阅读

评论收藏

内容反馈

sp_fyf_2024

粉丝: 2103
资源: 66

大模型响应嵌入增强大型语言模型的人类偏好对齐

大型语言模型 (LLM)全解读.pdf

大语言模型的工作原理与发展.pdf

大型语言模型的历史、发展和原理-入门性调查

个性化大型语言模型插件Persona-Plug的提出与评估

AIOS 是一种大型语言模型 （LLM） 代理操作系统，它将大型语言模型嵌入到操作系统 （OS） 中

国信证券-20230424-电子AI+系列专题报告（一）：AI大语言模型的原理、演进及算力测算.pdf

大型身体语言模型论文，不错的大语言模型研究工作

多模态大语言模型综述来啦！一文带你理清多模态关键技术

AI大模型.pdfAI大模型.pdf

基于嵌入表示的知识图谱实体对齐研究+人工智能+知识图谱+预训练模型

自然语言处理任务中语言模型发展总结

Python_探索大型语言模型在图上学习中的潜力.zip

文心大模型python源码

chap-语言模型与词嵌入1

自然语言处理-基于预训练模型的方法-笔记

大模型结构介绍，chatglm2模型的创新点

大语言模型集成应用器，大语言微调模型，结合本地知识库模式.zip

自然语言处理-基于预训练模型的方法 笔记

人工智能-大模型-基于大语言模型和 RAG 的知识库问答系统

从零开始：使用微调和嵌入训练自己的AI个性化大模型.docx

基于大语言模型和 RAG 的知识库问答系统 开箱即用、模型中立、灵活编排，支持快速嵌入到第三方业务系统

最新《预训练语言模型》2020综述论文大全【复旦大学】.pdf

NLP：自然语言处理的预训练模型Pre-trained Models for NLP- A Survey

2024全新Langchain大模型AI应用与多智能体实战开发

基于pytorch的中文语言模型预训练模型源码

大语言模型原理.docx

CrossLingualContextualEmb：上下文单词嵌入的跨语言对齐

博客中聚类算法（K-means、FCM、DBSCAN、DPC）的数据集（免积分）

机器学习期末复习题及答案

Ollama软件windows安装包(版本0.3.10)

最新资源

AIOS 是一种大型语言模型（LLM）代理操作系统，它将大型语言模型嵌入到操作系统（OS）中

自然语言处理-基于预训练模型的方法笔记

基于大语言模型和 RAG 的知识库问答系统开箱即用、模型中立、灵活编排，支持快速嵌入到第三方业务系统