没有合适的资源?快使用搜索试试~ 我知道了~
通过微调删除GPT-4中的RLHF保护.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 73 浏览量
2024-04-12
15:55:19
上传
评论
收藏 171KB PDF 举报
温馨提示
试读
6页
通过微调删除GPT-4中的RLHF保护.pdf
资源推荐
资源详情
资源评论
Removing RLHF Protections in GPT-4 via Fine-Tuning
Qiusi Zhan
1
, Richard Fang
1
, Rohan Bindu
1
, Akul Gupta
1
,
Tatsunori Hashimoto
2
, Daniel Kang
1
1
UIUC,
2
Stanford University
Abstract
As large language models (LLMs) have in-
creased in their capabilities, so does their po-
tential for dual use. To reduce harmful out-
puts, produces and vendors of LLMs have used
reinforcement learning with human feedback
(RLHF). In tandem, LLM vendors have been
increasingly enabling fine-tuning of their most
powerful models. However, concurrent work
has shown that fine-tuning can remove RLHF
protections. We may expect that the most pow-
erful models currently available (GPT-4) are
less susceptible to fine-tuning attacks.
In this work, we show the contrary: fine-tuning
allows attackers to remove RLHF protections
with as few as 340 examples and a 95% suc-
cess rate. These training examples can be auto-
matically generated with weaker models. We
further show that removing RLHF protections
does not decrease usefulness on non-censored
outputs, providing evidence that our fine-tuning
strategy does not decrease usefulness despite
using weaker models to generate training data.
Our results show the need for further research
on protections on LLMs.
1 Introduction
Large language models (LLMs) have become in-
creasingly capable, which has also increased their
potential for dual-use (Kang et al., 2023; Barrett
et al., 2023). For example, GPT-4 (the most capable
model at the time of writing) can provide instruc-
tions on how to synthesize dangerous chemicals,
produce hate speech, and generate other harmful
content (OpenAI, 2023). As a result, many of these
models are not released publicly and instead behind
APIs.
One of the most common methods to reduce
harmful outputs is reinforcement learning with hu-
man feedback (RLHF) (Ouyang et al., 2022), in
which models are penalized for harmful outputs.
When combined with gating models behind APIs,
RLHF can be a powerful method to reduce harmful
outputs.
However, these API providers are increasingly
providing methods to fine-tune the API-gated mod-
els, such as GPT-4. Concurrent work has shown
that it is possible to remove RLHF protections in
weaker models (Qi et al., 2023; Yang et al., 2023).
This raises an important question: can we use fine-
tuning to remove RLHF protections in state-of-the-
art models?
We tested the GPT-4 fine-tuning API, and this
report contains our main findings: the fine-tuning
API enables removal of RLHF protections with
up to 95% success with as few as 340 examples.
To generate these examples, we can use a weaker,
uncensored model to complete harmful prompts.
Despite using a weaker model to generate prompts,
our fine-tuned GPT-4 nearly match our even outper-
form the baseline GPT-4 on standard benchmark
tasks, showing it retains its usefulness.
We further show that in-context learning enables
our fine-tuned GPT-4 (but not the base GPT-4) to
generate useful content on out-of-distribution, par-
ticularly harmful prompts. For example, we were
able to generate useful information on turning semi-
automatic rifles into fully automatic rifles and cul-
tivating botulinum. Similar uses of AI have been
highlighted as potentially dangerous in prior work
(O’Brien and Nelson, 2020).
2 Background
Overview. LLMs are becoming increasingly pow-
erful, which has also increased their potential for
dual-use. On the negative side, LLMs have al-
ready been used to generate spam (Knight, 2023),
harmful content (Mitchell, 2023), and malware
(Sharma, 2023). Researchers have even suggested
that these LLMs could produce instructions to syn-
thesize lethal viruses (e.g., smallpox), create export-
controlled weapons (e.g., nuclear materials), and
arXiv:2311.05553v2 [cs.CL] 10 Nov 2023
资源评论
百态老人
- 粉丝: 1781
- 资源: 2万+
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功