【免费】OpenAI-2021-WebGPT-Browser-assistedquestion-answering

共1个文件

pdf：1个

需积分: 0 176 浏览量 2023-02-26 19:00:07 上传评论收藏 1.19MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

OpenAI_2021_WebGPT - Browser-assisted question-answering with human feedbackd.rar （1个子文件）

OpenAI_2021_WebGPT - Browser-assisted question-answering with human feedbackd.pdf 1.27MB

WebGPT: Browser-assisted question-answering with

human feedback

Reiichiro Nakano

∗

Jacob Hilton

∗

Suchir Balaji

∗

Jeff Wu Long Ouyang

Christina Kim Christopher Hesse Shantanu Jain Vineet Kosaraju

William Saunders Xu Jiang Karl Cobbe Tyna Eloundou Gretchen Krueger

Kevin Button Matthew Knight Benjamin Chess John Schulman

OpenAI

Abstract

We ﬁne-tune GPT-3 to answer long-form questions using a text-based web-

browsing environment, which allows the model to search and navigate the web.

By setting up the task so that it can be performed by humans, we are able to train

models on the task using imitation learning, and then optimize answer quality with

human feedback. To make human evaluation of factual accuracy easier, models

must collect references while browsing in support of their answers. We train and

evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our

best model is obtained by ﬁne-tuning GPT-3 using behavior cloning, and then

performing rejection sampling against a reward model trained to predict human

preferences. This model’s answers are preferred by humans 56% of the time to

those of our human demonstrators, and 69% of the time to the highest-voted answer

from Reddit.

1 Introduction

A rising challenge in NLP is long-form question-answering (LFQA), in which a paragraph-length

answer is generated in response to an open-ended question. LFQA systems have the potential

to become one of the main ways people learn about the world, but currently lag behind human

performance [Krishna et al., 2021]. Existing work tends to focus on two core components of the task,

information retrieval and synthesis.

In this work we leverage existing solutions to these components: we outsource document retrieval to

the Microsoft Bing Web Search API,

and utilize unsupervised pre-training to achieve high-quality

synthesis by ﬁne-tuning GPT-3 [Brown et al., 2020]. Instead of trying to improve these ingredients,

we focus on combining them using more faithful training objectives. Following Stiennon et al. [2020],

we use human feedback to directly optimize answer quality, allowing us to achieve performance

competitive with humans.

We make two key contributions:

∗

Equal contribution, order randomized. Correspondence to:

reiichiro@openai.com

jhilton@openai.

com, suchir@openai.com, joschu@openai.com

https://www.microsoft.com/en-us/bing/apis/bing-web-search-api

arXiv:2112.09332v3 [cs.CL] 1 Jun 2022

(a) Screenshot from the demonstration interface.

♦Question

How can I train the crows in my neighborhood to bring me gifts?

♦Quotes

From Gifts From Crows | Outside My Window (www.birdsoutsidemywindow.org)

> Many animals give gifts to members of their own species but crows and

other corvids are the only ones known to give gifts to humans.

♦Past actions

Search how to train crows to bring you gifts

Click Gifts From Crows | Outside My Window www.birdsoutsidemywindow.org

Quote

Back

♦Title

Search results for: how to train crows to bring you gifts

♦Scrollbar: 0 - 11

♦Text

【0†How to Make Friends With Crows - PetHelpf ul†pethelpful.com】

If you did this a few times, your crows would learn your new place, but

as I said, I’m not sure if they will follow or visit you there since it’s

probably not in their territory. The other option is simply to make new

crow friends with the crows that live in your new neighborhood.

【1†Gifts From Crows | Outside My Window†www.birdsoutsidemywindow.org】

The partial piece of apple may have been left behind when the crow was

startled rather than as a gift. If the crows bring bright objects you’ll

know for sure that it’s a gift because it’s not something they eat.

Brandi Williams says: May 28, 2020 at 7:19 am.

♦Actions left: 96

♦Next action

(b) Corresponding text given to the model.

Figure 1: An observation from our text-based web-browsing environment, as shown to human

demonstrators (left) and models (right). The web page text has been abridged for illustrative purposes.

•

We create a text-based web-browsing environment that a ﬁne-tuned language model can

interact with. This allows us to improve both retrieval and synthesis in an end-to-end fashion

using general methods such as imitation learning and reinforcement learning.

•

We generate answers with references: passages extracted by the model from web pages

while browsing. This is crucial for allowing labelers to judge the factual accuracy of answers,

without engaging in a difﬁcult and subjective process of independent research.

Our models are trained primarily to answer questions from ELI5 [Fan et al., 2019], a dataset of

questions taken from the “Explain Like I’m Five” subreddit. We collect two additional kinds of

data: demonstrations of humans using our web-browsing environment to answer questions, and

comparisons between two model-generated answers to the same question (each with their own set of

references). Answers are judged for their factual accuracy, coherence, and overall usefulness.

We use this data in four main ways: behavior cloning (i.e., supervised ﬁne-tuning) using the demon-

strations, reward modeling using the comparisons, reinforcement learning against the reward model,

and rejection sampling against the reward model. Our best model uses a combination of behavior

cloning and rejection sampling. We also ﬁnd reinforcement learning to provide some beneﬁt when

inference-time compute is more limited.

We evaluate our best model in three different ways. First, we compare our model’s answers to answers

written by our human demonstrators on a held-out set of questions. Our model’s answers are preferred

56% of the time, demonstrating human-level usage of the text-based browser. Second, we compare

our model’s answers (with references stripped, for fairness) to the highest-voted answer provided

by the ELI5 dataset. Our model’s answers are preferred 69% of the time. Third, we evaluate our

model on TruthfulQA [Lin et al., 2021], an adversarial dataset of short-form questions. Our model’s

answers are true 75% of the time, and are both true and informative 54% of the time, outperforming

our base model (GPT-3), but falling short of human performance.

The remainder of the paper is structured as follows:

•

In Section 2, we describe our text-based web-browsing environment and how our models

interact with it.

• In Section 3, we explain our data collection and training methods in more detail.

•

In Section 4, we evaluate our best-performing models (for different inference-time compute

budgets) on ELI5 and TruthfulQA.

•

In Section 5, we provide experimental results comparing our different methods and how

they scale with dataset size, parameter count, and inference-time compute.

•

In Section 6, we discuss the implications of our ﬁndings for training models to answer

questions truthfully, and broader impacts.

Table 1: Actions the model can take. If a model generates any other text, it is considered to be an

invalid action. Invalid actions still count towards the maximum, but are otherwise ignored.

Command Effect

Search <query> Send <query> to the Bing API and display a search results page

Clicked on link <link ID> Follow the link with the given ID to a new page

Find in page: <text> Find the next occurrence of <text> and scroll to it

Quote: <text> If <text> is found in the current page, add it as a reference

Scrolled down <1, 2, 3> Scroll down a number of times

Scrolled up <1, 2, 3> Scroll up a number of times

Top Scroll to the top of the page

Back Go to the previous page

End: Answer End browsing and move to answering phase

End: <Nonsense, Controversial> End browsing and skip answering phase

2 Environment design

Previous work on question-answering such as REALM [Guu et al., 2020] and RAG [Lewis et al.,

2020a] has focused on improving document retrieval for a given query. Instead, we use a familiar

existing method for this: a modern search engine (Bing). This has two main advantages. First,

modern search engines are already very powerful, and index a large number of up-to-date documents.

Second, it allows us to focus on the higher-level task of using a search engine to answer questions,

something that humans can do well, and that a language model can mimic.

For this approach, we designed a text-based web-browsing environment. The language model is

prompted with a written summary of the current state of the environment, including the question, the

text of the current page at the current cursor location, and some other information (see Figure 1(b)).

In response to this, the model must issue one of the commands given in Table 1, which performs an

action such as running a Bing search, clicking on a link, or scrolling around. This process is then

repeated with a fresh context (hence, the only memory of previous steps is what is recorded in the

summary).

While the model is browsing, one of the actions it can take is to quote an extract from the current

page. When this is performed, the page title, domain name and extract are recorded to be used later

as a reference. Browsing then continues until either the model issues a command to end browsing,

the maximum number of actions has been reached, or the maximum total length of references has

been reached. At this point, as long as there is at least one reference, the model is prompted with the

question and the references, and must compose its ﬁnal answer.

Further technical details about our environment can be found in Appendix A.

3 Methods

3.1 Data collection

Guidance from humans is central to our approach. A language model pre-trained on natural language

would not be able to use our text-based browser, since it does not know the format of valid commands.

We therefore collected examples of humans using the browser to answer questions, which we call

demonstrations. However, training on demonstrations alone does not directly optimize answer quality,

and is unlikely to lead far beyond human performance [Stiennon et al., 2020]. We therefore collected

pairs of model-generated answers to the same question, and asked humans which one they preferred,

which we call comparisons.

For both demonstrations and comparisons, the vast majority of questions were taken from ELI5 [Fan

et al., 2019], a dataset of long-form questions. For diversity and experimentation, we also mixed in

a small number of questions from other sources, such as TriviaQA [Joshi et al., 2017]. In total, we

collected around 6,000 demonstrations, 92% of which were for questions from ELI5, and around

21,500 comparisons, 98% of which were for questions from ELI5. A more detailed breakdown of the

questions we used along with post-processing details can be found in Appendix B.

Table 2: An answer produced by our 175B best-of-64 model to a randomly-chosen question from the

ELI5 test set (not cherry-picked). The full text of the references can be found in Appendix J, along

with answers from our human demonstrators and the ELI5 dataset. Further samples are available at

https://openaipublic.blob.core.windows.net/webgpt-answer-viewer/index.html.

Question Why did we decide that certain words were "bad" and shouldn’t be used in social settings?

Answer

We have all been taught that there are certain words that should not be used in public, at least

not without good reason. But why do some words fall into this category while others do not? It

turns out that there is not a single answer to this question, as each word has a different history

and carries different meanings in different cultures and families [2]. However, there are some

general categories that "bad" words tend to fall into. "Bad" words generally relate to parts of life

that we don’t like talking about in public, like bathroom functions, or negative ways of talking

about people’s religion, skin color, ability, or gender [3]. Often, people say bad words when

they are experiencing strong emotions, and those words are linked to our emotions [3].

References

(titles

only)

[1, 2, 3] Why Are Some Words ‘Bad’? | Vermont Public Radio (www.vpr.org)

[4] On Words: ‘Bad’ Words and Why We Should Study Them | UVA Today (news.virginia.edu)

[5] The Science of Curse Words: Why The &@$! Do We Swear? (www.babbel.com)

To make it easier for humans to provide demonstrations, we designed a graphical user interface for

the environment (see Figure 1(a)). This displays essentially the same information as the text-based

interface and allows any valid action to be performed, but is more human-friendly. For comparisons,

we designed a similar interface, allowing auxiliary annotations as well as comparison ratings to be

provided, although only the ﬁnal comparison ratings (better, worse or equally good overall) were

used in training.

For both demonstrations and comparisons, we emphasized that answers should be relevant, coherent,

and supported by trustworthy references. Further details about these criteria and other aspects of our

data collection pipeline can be found in Appendix C.

We are releasing a dataset of comparisons, the details of which can be found in Appendix K.

3.2 Training

The use of pre-trained models is crucial to our approach. Many of the underlying capabilities required

to successfully use our environment to answer questions, such as reading comprehension and answer

synthesis, emerge as zero-shot capabilities of language models [Brown et al., 2020]. We therefore

ﬁne-tuned models from the GPT-3 model family, focusing on the 760M, 13B and 175B model sizes.

Starting from these models, we used four main training methods:

1. Behavior cloning (BC).

We ﬁne-tuned on the demonstrations using supervised learning,

with the commands issued by the human demonstrators as labels.

2. Reward modeling (RM).

Starting from the BC model with the ﬁnal unembedding layer

removed, we trained a model to take in a question and an answer with references, and output

a scalar reward. Following Stiennon et al. [2020], the reward represents an Elo score, scaled

such that the difference between two scores represents the logit of the probability that one

will be preferred to the other by the human labelers. The reward model is trained using a

cross-entropy loss, with the comparisons as labels. Ties are treated as soft 50% labels.

3. Reinforcement learning (RL).

Once again following Stiennon et al. [2020], we ﬁne-tuned

the BC model on our environment using PPO [Schulman et al., 2017]. For the environment

reward, we took the reward model score at the end of each episode, and added this to a KL

penalty from the BC model at each token to mitigate overoptimization of the reward model.

4. Rejection sampling (best-of-n).

We sampled a ﬁxed number of answers (4, 16 or 64) from

either the BC model or the RL model (if left unspeciﬁed, we used the BC model), and

selected the one that was ranked highest by the reward model. We used this as an alternative

method of optimizing against the reward model, which requires no additional training, but

instead uses more inference-time compute.

We used mutually disjoint sets of questions for each of BC, RM and RL.

For BC, we held out around 4% of the demonstrations to use as a validation set.

For RM, we sampled answers for the comparison datasets in an ad-hoc manner, using models of

various sizes (but primarily the 175B model size), trained using various combinations of methods and

hyperparameters, and combined them into a single dataset. This was for data efﬁciency: we collected

many comparisons for evaluation purposes, such as for tuning hyperparameters, and did not want to

waste this data. Our ﬁnal reward models were trained on around 16,000 comparisons, the remaining

5,500 being used for evaluation only.

For RL, we trained on a mixture of 90% questions from ELI5 and 10% questions from TriviaQA.

To improve sample efﬁciency, at the end of each episode we inserted 15 additional answering-only

episodes using the same references as the previous episode. We were motivated to try this because

answering explained slightly more of the variance in reward model score than browsing despite taking

many fewer steps, and we found it to improve sample efﬁciency by approximately a factor of 2. We

also randomized the maximum number of browsing actions, sampling uniformly from the range

20–100 inclusive.

Hyperparameters for all of our training methods can be found in Appendix E.

4 Evaluation

In evaluating our approach, we focused on three “WebGPT” models, each of which was trained with

behavior cloning followed by rejection sampling against a reward model of the same size: a 760M

best-of-4 model, a 13B best-of-16 model and a 175B best-of-64 model. As discussed in Section 5.2,

these are compute-efﬁcient models corresponding to different inference-time compute budgets. We

excluded RL for simplicity, since it did not provide signiﬁcant beneﬁt when combined with rejection

sampling (see Figure 4).

We evaluated all WebGPT models using a sampling temperature of 0.8, which was tuned using human

evaluations, and with a maximum number of browsing actions of 100.

4.1 ELI5

We evaluated WebGPT on the ELI5 test set in two different ways:

We compared model-generated answers to answers written by demonstrators using our

web-browsing environment. For these comparisons, we used the same procedure as compar-

isons used for reward model training. We consider this to be a fair comparison, since the

instructions for demonstrations and comparisons emphasize a very similar set of criteria.

We compared model-generated answers to the reference answers from the ELI5 dataset,

which are the highest-voted answers from Reddit. In this case, we were concerned about

ecological validity, since our detailed comparison criteria may not match those of real-life

users. We were also concerned about blinding, since Reddit answers do not typically include

citations. To mitigate these concerns, we stripped all citations and references from the

model-generated answers, hired new contractors who were not familiar with our detailed

instructions, and gave them a much more minimal set of instructions, which are given in

Appendix F.

In both cases, we treat ties as 50% preference ratings (rather than excluding them).

Our results are shown in Figure 2. Our best model, the 175B best-of-64 model, produces answers

that are preferred to those written by our human demonstrators 56% of the time. This suggests that

the use of human feedback is essential, since one would not expect to exceed 50% preference by

imitating demonstrations alone (although it may still be possible, by producing a less noisy policy).

The same model produces answers that are preferred to the reference answers from the ELI5 dataset

69% of the time. This is a substantial improvement over Krishna et al. [2021], whose best model’s

answers are preferred 23% of the time to the reference answers, although they use substantially less

compute than even our smallest model.

评论收藏

内容反馈

Matlab仿真实验室

粉丝: 2w+
资源: 2180

OpenAI-2021-WebGPT - Browser-assisted question-answering

最新资源

OpenAI-2021-WebGPT - Browser-assisted question-answering

OpenAI-2021-WebGPT - Browser-assisted question-answering with

Browser-assisted question-answering with human feedbackd.pdf

OpenAI-2021-ChatGPT - Browser-assisted question-answering with

OpenAI-2021-WebGPT

ai-assisted-annotation-client:AI辅助注释SDK的客户端集成示例源代码和库

2019 Analog-and-Algorithm-Assisted Ultra-low Power Biosignal Acquisition Systems

Assisted-Global Navigation Satellite Systems

UAV-Assisted Multi-Access Edge Computing System An Energy-Efficient

ltraviolet-assisted-gas

Spray-assisted assembled spherical boron nitride as fillers

MEC-Assisted Panoramic VR Video Streaming Over Millimeter Wave

A-GPS Assisted GPS- GNSS- and SBAS

2018-A Survey of Force-Assisted Robotic Cell Microinjection Technologies

Server-Side GPS and Assisted-GPS in Java

Application of ionic liquid-based ultrasound-assisted dispersive liquid-liquid microextraction in animal biological fluids as pre-slaughter monitoring method for acaricide residues

Analog Baseband Chain With Digital-Assisted DC-Offset Calibration for UWB

Python库 | assisted-service-client-1.0.23.2.post40.tar.gz

Python库 | assisted-service-client-1.0.18.1.post140.tar.gz

Python库 | assisted-service-client-1.0.24.1.post9.tar.gz

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

行人跌倒数据集（VOC格式）

YOLOV5 + 双目相机实现三维测距（新版本）

全新的SOTA模型YOLOv9

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

最新资源