通过微调删除GPT-4中的RLHF保护.pdf资源-CSDN文库

版权申诉

73 浏览量 2024-04-12 15:55:19 上传评论收藏 171KB PDF 举报

资源推荐

资源详情

资源评论

Removing RLHF Protections in GPT-4 via Fine-Tuning

Qiusi Zhan

, Richard Fang

, Rohan Bindu

, Akul Gupta

Tatsunori Hashimoto

, Daniel Kang

UIUC,

Stanford University

Abstract

As large language models (LLMs) have in-

creased in their capabilities, so does their po-

tential for dual use. To reduce harmful out-

puts, produces and vendors of LLMs have used

reinforcement learning with human feedback

(RLHF). In tandem, LLM vendors have been

increasingly enabling ﬁne-tuning of their most

powerful models. However, concurrent work

has shown that ﬁne-tuning can remove RLHF

protections. We may expect that the most pow-

erful models currently available (GPT-4) are

less susceptible to ﬁne-tuning attacks.

In this work, we show the contrary: ﬁne-tuning

allows attackers to remove RLHF protections

with as few as 340 examples and a 95% suc-

cess rate. These training examples can be auto-

matically generated with weaker models. We

further show that removing RLHF protections

does not decrease usefulness on non-censored

outputs, providing evidence that our ﬁne-tuning

strategy does not decrease usefulness despite

using weaker models to generate training data.

Our results show the need for further research

on protections on LLMs.

1 Introduction

Large language models (LLMs) have become in-

creasingly capable, which has also increased their

potential for dual-use (Kang et al., 2023; Barrett

et al., 2023). For example, GPT-4 (the most capable

model at the time of writing) can provide instruc-

tions on how to synthesize dangerous chemicals,

produce hate speech, and generate other harmful

content (OpenAI, 2023). As a result, many of these

models are not released publicly and instead behind

APIs.

One of the most common methods to reduce

harmful outputs is reinforcement learning with hu-

man feedback (RLHF) (Ouyang et al., 2022), in

which models are penalized for harmful outputs.

When combined with gating models behind APIs,

RLHF can be a powerful method to reduce harmful

outputs.

However, these API providers are increasingly

providing methods to ﬁne-tune the API-gated mod-

els, such as GPT-4. Concurrent work has shown

that it is possible to remove RLHF protections in

weaker models (Qi et al., 2023; Yang et al., 2023).

This raises an important question: can we use ﬁne-

tuning to remove RLHF protections in state-of-the-

art models?

We tested the GPT-4 ﬁne-tuning API, and this

report contains our main ﬁndings: the ﬁne-tuning

API enables removal of RLHF protections with

up to 95% success with as few as 340 examples.

To generate these examples, we can use a weaker,

uncensored model to complete harmful prompts.

Despite using a weaker model to generate prompts,

our ﬁne-tuned GPT-4 nearly match our even outper-

form the baseline GPT-4 on standard benchmark

tasks, showing it retains its usefulness.

We further show that in-context learning enables

our ﬁne-tuned GPT-4 (but not the base GPT-4) to

generate useful content on out-of-distribution, par-

ticularly harmful prompts. For example, we were

able to generate useful information on turning semi-

automatic riﬂes into fully automatic riﬂes and cul-

tivating botulinum. Similar uses of AI have been

highlighted as potentially dangerous in prior work

(O’Brien and Nelson, 2020).

2 Background

Overview. LLMs are becoming increasingly pow-

erful, which has also increased their potential for

dual-use. On the negative side, LLMs have al-

ready been used to generate spam (Knight, 2023),

harmful content (Mitchell, 2023), and malware

(Sharma, 2023). Researchers have even suggested

that these LLMs could produce instructions to syn-

thesize lethal viruses (e.g., smallpox), create export-

controlled weapons (e.g., nuclear materials), and

arXiv:2311.05553v2 [cs.CL] 10 Nov 2023

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余5页未读，立即下载

评论收藏

内容反馈

版权申诉

百态老人

粉丝: 1781
资源: 2万+

通过微调删除GPT-4中的RLHF保护.pdf

文心一言、GPT3.5及GPT-4的应用测评对比.rar

利用 GPT-4 识别癌症表型..pdf

微软研究院：人工通用智能的星星之火-GPT-4的早期实验.pdf

OpenAI _ GPT-3新模型Davinci，将AI写作提升到新水平！网友惊呼：GPT-4要来了？.pdf

微软 GPT-4 报告 人工通用智能的星星之火GPT-4的早期实验.pdf

GPT-4技术报告（英）-2023-98页.pdf

新版AI系统源码ChatGPT网站源码支持GPT-4支持AI绘画.rar

Developing Apps with GPT-4 and ChatGPT-使用GPT-4和ChatGPT开发应用程序-by

海外ChatGPT、GPT-4如何赋能应用.pdf

openwrt-x86-64-uefi-gpt-squashfs.img

使用 GPT-4 Turbo-128k 优化 Automuse.pdf

GPT-4和GPT-3客户端 g4f.webview.0.2.8.0.exe

Multimodal-GPT-add-baize.zip

Developing Apps with GPT-4 and ChatGPT.pdf

GPT-4震撼来袭：发布全文（GPT-4翻译）_MarsBit 2023-5-26 133857 1.pdf

让GPT-3、ChatGPT、GPT-4一起做脑筋急转弯，GPT-4一骑绝尘！.pdf

人工智能行业点评：OpenAI访问限流，GPT-4算力测算.pdf

传媒行业周报2023年11期：GPT-4、文心一言发布，应用端落地可期-20230319-16页.pdf

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

李飞飞自传 我看见的世界 The World I see

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

4个亲测好用的ChatGPT4渠道

农村公交与异构无人机协同配送优化

学术海报模板+论文科研+研究生

北森能力测评题库.zip

最新资源

微软 GPT-4 报告人工通用智能的星星之火GPT-4的早期实验.pdf

李飞飞自传我看见的世界 The World I see