【免费】ChatGPT4.0论文（英文）.zip资源-CSDN文库

共2个文件

txt：1个

pdf：1个

毕业设计

需积分: 0 124 浏览量 2023-04-25 12:26:02 上传评论收藏 4.71MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

ChatGPT4.0论文（英文）.zip （2个子文件）

ChatGPT4.0论文（英文）.pdf 4.85MB

使用前必看.txt 24B

GPT-4 Technical Report

OpenAI

∗

Abstract

We report the development of GPT-4, a large-scale, multimodal model which can

accept image and text inputs and produce text outputs. While less capable than

humans in many real-world scenarios, GPT-4 exhibits human-level performance

on various professional and academic benchmarks, including passing a simulated

bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-

based model pre-trained to predict the next token in a document. The post-training

alignment process results in improved performance on measures of factuality and

adherence to desired behavior. A core component of this project was developing

infrastructure and optimization methods that behave predictably across a wide

range of scales. This allowed us to accurately predict some aspects of GPT-4’s

performance based on models trained with no more than 1/1,000th the compute of

GPT-4.

1 Introduction

This technical report presents GPT-4, a large multimodal model capable of processing image and

text inputs and producing text outputs. Such models are an important area of study as they have the

potential to be used in a wide range of applications, such as dialogue systems, text summarization,

and machine translation. As such, they have been the subject of substantial interest and progress in

recent years [1–28].

One of the main goals of developing such models is to improve their ability to understand and generate

natural language text, particularly in more complex and nuanced scenarios. To test its capabilities

in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In

these evaluations it performs quite well and often outscores the vast majority of human test takers.

For example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.

This contrasts with GPT-3.5, which scores in the bottom 10%.

On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models

and most state-of-the-art systems (which often have benchmark-speciﬁc training or hand-engineering).

On the MMLU benchmark [

], an English-language suite of multiple-choice questions covering

57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but

also demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4

surpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these

model capability results, as well as model safety improvements and results, in more detail in later

sections.

This report also discusses a key challenge of the project, developing deep learning infrastructure and

optimization methods that behave predictably across a wide range of scales. This allowed us to make

predictions about the expected performance of GPT-4 (based on small runs trained in similar ways)

that were tested against the ﬁnal run to increase conﬁdence in our training.

Despite its capabilities, GPT-4 has similar limitations to earlier GPT models [

]: it is not fully

reliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn

∗

Please cite this work as “OpenAI (2023)". Full authorship contribution statements appear at the end of the

document.

from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts

where reliability is important.

GPT-4’s capabilities and limitations create signiﬁcant and novel safety challenges, and we believe

careful study of these challenges is an important area of research given the potential societal impact.

This report includes an extensive system card (after the Appendix) describing some of the risks we

foresee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more.

It also describes interventions we made to mitigate potential harms from the deployment of GPT-4,

including adversarial testing with domain experts, and a model-assisted safety pipeline.

2 Scope and Limitations of this Technical Report

This report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a

Transformer-style model [

] pre-trained to predict the next token in a document, using both publicly

available data (such as internet data) and data licensed from third-party providers. The model was

then ﬁne-tuned using Reinforcement Learning from Human Feedback (RLHF) [

]. Given both

the competitive landscape and the safety implications of large-scale models like GPT-4, this report

contains no further details about the architecture (including model size), hardware, training compute,

dataset construction, training method, or similar.

We are committed to independent auditing of our technologies, and shared some initial steps and

ideas in this area in the system card accompanying this release.

We plan to make further technical

details available to additional third parties who can advise us on how to weigh the competitive and

safety considerations above against the scientiﬁc value of further transparency.

3 Predictable Scaling

A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The

primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive

model-speciﬁc tuning. To address this, we developed infrastructure and optimization methods that

have very predictable behavior across multiple scales. These improvements allowed us to reliably

predict some aspects of the performance of GPT-4 from smaller models trained using

1, 000×

–

10, 000× less compute.

3.1 Loss Prediction

The ﬁnal loss of properly-trained large language models is thought to be well approximated by power

laws in the amount of compute used to train the model [35, 36, 2, 14, 15].

To verify the scalability of our optimization infrastructure, we predicted GPT-4’s ﬁnal loss on our

internal codebase (not part of the training set) by ﬁtting a scaling law with an irreducible loss term

(as in Henighan et al.

[15]

L(C) = aC

+ c,

from models trained using the same methodology

but using at most 10,000x less compute than GPT-4. This prediction was made shortly after the run

started, without use of any partial results. The ﬁtted scaling law predicted GPT-4’s ﬁnal loss with

high accuracy (Figure 1).

3.2 Scaling of Capabilities on HumanEval

Having a sense of the capabilities of a model before training can improve decisions around alignment,

safety, and deployment. In addition to predicting ﬁnal loss, we developed methodology to predict

more interpretable metrics of capability. One such metric is pass rate on the HumanEval dataset [

which measures the ability to synthesize Python functions of varying complexity. We successfully

predicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained

with at most 1, 000× less compute (Figure 2).

For an individual problem in HumanEval, performance may occasionally worsen with scale. Despite

these challenges, we ﬁnd an approximate power law relationship

−E

[log(pass_rate(C))] = α∗C

−k

In addition to the accompanying system card, OpenAI will soon publish additional thoughts on the social

and economic implications of AI systems, including the need for effective regulation.

Observed

Prediction

gpt-4

100p 10n 1µ 100µ 0.01 1

Compute

1.0

2.0

3.0

4.0

5.0

6.0

Bits

per

word

OpenAI

codebase

word

prediction

Figure 1. Performance of GPT-4 and smaller models. The metric is ﬁnal loss on a dataset derived

from our internal codebase. This is a convenient, large dataset of code tokens which is not contained in

the training set. We chose to look at loss because it tends to be less noisy than other measures across

different amounts of training compute. A power law ﬁt to the smaller models (excluding GPT-4) is

shown as the dotted line; this ﬁt accurately predicts GPT-4’s ﬁnal loss. The x-axis is training compute

normalized so that GPT-4 is 1.

Observed

Prediction

gpt-4

1µ 10µ 100µ 0.001 0.01 0.1 1

Compute

Mean

Log

Pass

Rate

Capability

prediction

coding

problems

Figure 2. Performance of GPT-4 and smaller models. The metric is mean log pass rate on a subset of

the HumanEval dataset. A power law ﬁt to the smaller models (excluding GPT-4) is shown as the dotted

line; this ﬁt accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that

GPT-4 is 1.

where

and

are positive constants, and

is a subset of problems in the dataset. We hypothesize

that this relationship holds for all problems in this dataset. In practice, very low pass rates are difﬁcult

or impossible to estimate, so we restrict to problems

and models

such that given some large

sample budget, every problem is solved at least once by every model.

We registered predictions for GPT-4’s performance on HumanEval before training completed, using

only information available prior to training. All but the 15 hardest HumanEval problems were split

into 6 difﬁculty buckets based on the performance of smaller models. The results on the

easiest

bucket are shown in Figure 2, showing that the resulting predictions were very accurate for this

subset of HumanEval problems where we can accurately estimate

log(pass_rate)

for several smaller

models. Predictions on the other ﬁve buckets performed almost as well, the main exception being

GPT-4 underperforming our predictions on the easiest bucket.

Certain capabilities remain hard to predict. For example, the Inverse Scaling Prize [

] proposed

several tasks for which model performance decreases as a function of scale. Similarly to a recent

result by Wei et al.

[39]

, we ﬁnd that GPT-4 reverses this trend, as shown on one of the tasks called

Hindsight Neglect [40] in Figure 3.

ada babbage curie gpt-3.5 gpt-4

Model

100

Accuracy

Inverse

Scaling

Prize,

hindsight

neglect

Figure 3. Performance of GPT-4 and smaller models on the Hindsight Neglect task. Accuracy is shown

on the y-axis, higher is better. ada, babbage, and curie refer to models available via the OpenAI API

[41].

We believe that accurately predicting future capabilities is important for safety. Going forward we

plan to reﬁne these methods and register performance predictions across various capabilities before

large model training begins, and we hope this becomes a common goal in the ﬁeld.

4 Capabilities

We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally

designed for humans.

We did no speciﬁc training for these exams. A minority of the problems in the

exams were seen by the model during training; for each exam we run a variant with these questions

removed and report the lower score of the two. We believe the results to be representative. For further

details on contamination (methodology and per-exam statistics), see Appendix C.

Exams were sourced from publicly-available materials. Exam questions included both multiple-

choice and free-response questions; we designed separate prompts for each format, and images were

included in the input for questions which required it. The evaluation setup was designed based

on performance on a validation set of exams, and we report ﬁnal results on held-out test exams.

Overall scores were determined by combining multiple-choice and free-response question scores

using publicly available methodologies for each exam. See Appendix A for further details on the

exam evaluation methodology.

We used the post-trained RLHF model for these exams.

Exam GPT-4 GPT-4 (no vision) GPT-3.5

Uniform Bar Exam (MBE+MEE+MPT) 298 / 400 (~90th) 298 / 400 (~90th) 213 / 400 (~10th)

LSAT 163 (~88th) 161 (~83rd) 149 (~40th)

SAT Evidence-Based Reading & Writing 710 / 800 (~93rd) 710 / 800 (~93rd) 670 / 800 (~87th)

SAT Math 700 / 800 (~89th) 690 / 800 (~89th) 590 / 800 (~70th)

Graduate Record Examination (GRE) Quantitative

163 / 170 (~80th) 157 / 170 (~62nd) 147 / 170 (~25th)

Graduate Record Examination (GRE) Verbal 169 / 170 (~99th) 165 / 170 (~96th) 154 / 170 (~63rd)

Graduate Record Examination (GRE) Writing 4 / 6 (~54th) 4 / 6 (~54th) 4 / 6 (~54th)

USABO Semiﬁnal Exam 2020

87 / 150 (99th - 100th) 87 / 150 (99th - 100th)

43 / 150 (31st - 33rd)

USNCO Local Section Exam 2022 36 / 60 38 / 60 24 / 60

Medical Knowledge Self-Assessment Program 75 % 75 % 53 %

Codeforces Rating 392 (below 5th) 392 (below 5th) 260 (below 5th)

AP Art History 5 (86th - 100th) 5 (86th - 100th) 5 (86th - 100th)

AP Biology 5 (85th - 100th) 5 (85th - 100th) 4 (62nd - 85th)

AP Calculus BC 4 (43rd - 59th) 4 (43rd - 59th) 1 (0th - 7th)

AP Chemistry 4 (71st - 88th) 4 (71st - 88th) 2 (22nd - 46th)

AP English Language and Composition 2 (14th - 44th) 2 (14th - 44th) 2 (14th - 44th)

AP English Literature and Composition 2 (8th - 22nd) 2 (8th - 22nd) 2 (8th - 22nd)

AP Environmental Science 5 (91st - 100th) 5 (91st - 100th) 5 (91st - 100th)

AP Macroeconomics 5 (84th - 100th) 5 (84th - 100th) 2 (33rd - 48th)

AP Microeconomics 5 (82nd - 100th) 4 (60th - 82nd) 4 (60th - 82nd)

AP Physics 2 4 (66th - 84th) 4 (66th - 84th) 3 (30th - 66th)

AP Psychology 5 (83rd - 100th) 5 (83rd - 100th) 5 (83rd - 100th)

AP Statistics 5 (85th - 100th) 5 (85th - 100th) 3 (40th - 63rd)

AP US Government 5 (88th - 100th) 5 (88th - 100th) 4 (77th - 88th)

AP US History 5 (89th - 100th) 4 (74th - 89th) 4 (74th - 89th)

AP World History 4 (65th - 87th) 4 (65th - 87th) 4 (65th - 87th)

AMC 10 30 / 150 (6th - 12th) 36 / 150 (10th - 19th) 36 / 150 (10th - 19th)

AMC 12 60 / 150 (45th - 66th) 48 / 150 (19th - 40th) 30 / 150 (4th - 8th)

Introductory Sommelier (theory knowledge) 92 % 92 % 80 %

Certiﬁed Sommelier (theory knowledge) 86 % 86 % 58 %

Advanced Sommelier (theory knowledge) 77 % 77 % 46 %

Leetcode (easy) 31 / 41 31 / 41 12 / 41

Leetcode (medium) 21 / 80 21 / 80 8 / 80

Leetcode (hard) 3 / 45 3 / 45 0 / 45

Table 1. GPT performance on academic and professional exams. In each case, we simulate the

conditions and scoring of the real exam. We report GPT-4’s ﬁnal score graded according to exam-

speciﬁc rubrics, as well as the percentile of test-takers achieving GPT-4’s score.

评论收藏

内容反馈

qq_33291299

粉丝: 7
资源: 92

ChatGPT4.0论文（英文）.zip

ChatGPT4.0论文（英文+中文版）.zip

ChatGPT4.0论文（中文）.zip

ChatGPT4.0论文（英文版）.zip

ChatGPT4.0论文（中文版）.zip

ChatGPT4.0论文（中英版）.zip

ChatGPT4.0笔记 .zip

【毕业设计源代码+论文】ASP+ACCESS网上答疑管理系统毕业设计(源代码+论文)

ChatGPT4.0论文（英文）.pdf

34个经典javaweb项目实例.zip

毕业设计 springBoot人力资源管理系统+毕业论文+前后端源代码

项目源码：基于Hadoop+Spark招聘推荐可视化系统 大数据项目 计算机毕业设计

基于spring boot的小区物业管理系统源码+论文+答辩ppt

计算机毕业设计：Flask股票数据采集分析可视化系统 python+爬虫+金融数据

优秀毕业设计：基于transformer的序列数据二分类完整代码+数据可直接运行

毕业设计-基于JAVA的springboot超市进销存系统(源代码+论文）

人脸识别系统OpenCV+dlib+python（含数据库）Pyqt5界面设计 项目源码 毕业设计

沈阳工程学院 毕业论文 模板 2024年

基于深度学习的课堂行为识别和考试作弊检测系统的设计与实现（python源码）

基于51单片机的智能电子秤系统设计(含代码仿真及论文)

不错的可用来练手、课程设计、毕业设计的Javaweb项目源码：仓库管理系统.rar

计算机毕业设计源码：基于python旅游推荐系统+爬虫+分析可视化 +django框架

基于SpringBoot+Vue的学生选课管理系统的毕业设计，Vue+SpringBoot+MybatisPlus+MySQL

【微信小程序源码期末大作业毕业设计】仿美团外卖小程序-小程序项目源码.zip

yolov8完整源码+权重文件

计算机毕业设计：基于python微博舆情分析可视化系统+爬虫+情感分析+Flask框架 项目源码

基于Hadoop+Spark招聘推荐可视化系统 大数据项目 毕业设计（源码下载）

最新资源

项目源码：基于Hadoop+Spark招聘推荐可视化系统大数据项目计算机毕业设计

人脸识别系统OpenCV+dlib+python（含数据库）Pyqt5界面设计项目源码毕业设计

沈阳工程学院毕业论文模板 2024年

计算机毕业设计：基于python微博舆情分析可视化系统+爬虫+情感分析+Flask框架项目源码

基于Hadoop+Spark招聘推荐可视化系统大数据项目毕业设计（源码下载）