英文版：微软154页测评GPT-4

需积分: 6 191 浏览量 2023-03-28 10:14:39 上传评论 2 收藏 7MB PDF 举报

2023年3月份微软评测：GPT-4已经初具通用人工智能的雏形。通过测试，微软的研究者证实：GPT-4不仅精通语言，还能在数学、编程、视觉、医学、法律、心理学等多样化和高难度的任务中表现出色，且无需特别提示。AGI的智能体现在能够像人类一样思考和推理，并且还能够涵盖广泛的认知技能和能力。论文中，指出AGI具有推理、规划、解决问题、抽象思维、理解复杂思想、快速学习和经验学习能力。论文地址：https：//arxiv.org/pdf/2303.12712.pdf，《人工智能的火花：对GPT-4早期实验的探索》人工智能（AI）研究者一直在发展和优化大型语言模型（LLMs），这些模型在多个领域和任务中展现出令人惊叹的能力，挑战了我们对学习和认知的理解。OpenAI开发的最新模型GPT-4，借助前所未有的计算规模和数据进行训练，其性能提升显著。本文报告了我们在GPT-4早期版本上的研究，当时它仍处于OpenAI的活跃开发阶段。我们认为这个早期版本的GPT-4，与ChatGPT和Google的PaLM等一起，属于新一代的LLMs，它们展现出比以往AI模型更广泛的智能。 GPT-4不仅在语言领域表现出色，而且能够在数学、编码、视觉、医学、法律、心理学等多个复杂和高难度的任务中解决问题，无需特殊提示。在所有这些任务中，GPT-4的表现几乎接近人类水平，甚至远超前代模型如ChatGPT。考虑到GPT-4的能力广度和深度，可以合理地将其视为一个早期（但仍然不完整）的人工通用智能（AGI）系统。我们的研究重点是发现GPT-4的局限性，并讨论迈向更深入、更全面的AGI版本所面临的挑战。这可能需要超越下一个词预测的新范式。我们强调，尽管GPT-4已经取得了显著的进步，但在推动AGI的发展过程中，还有许多技术难题需要解决。在对GPT-4的探索中，我们对其在数学问题解决中的表现进行了测试。GPT-4能够解决新颖和复杂的数学问题，显示出了强大的逻辑推理和抽象思维能力。此外，它在编程任务中也展现了卓越的代码编写和理解能力，能够理解和生成各种编程语言的代码，甚至包括调试和优化现有代码。在视觉任务上，GPT-4展示了理解和解析图像的能力，尽管不如专门的计算机视觉模型那样精细，但它的表现依然令人印象深刻。在医学领域，GPT-4能够理解和解释医学文献，帮助医生做出初步诊断。在法律方面，它能够理解和应用法律条款，为法律咨询提供帮助。在心理学领域，GPT-4能理解人类行为和心理状态，虽然不能替代专业心理咨询，但可以作为辅助工具提供信息。然而，GPT-4的局限性也不容忽视。例如，它可能无法处理实时更新的信息，缺乏对新知识的学习和适应能力。同时，模型的决策过程缺乏透明度，可能产生不可预知的结果。此外，伦理和隐私问题也需要关注，因为强大的AI系统可能被滥用或误导。鉴于最近的技术飞跃对社会产生的影响，我们反思了未来研究的方向。我们需要确保AI的发展不仅在技术上取得进步，还要在道德、法律和社会责任方面同步发展。为了实现这一目标，我们需要跨学科的合作，包括科学家、政策制定者、伦理学家和公众，共同构建一个安全、公正和有益的AI未来。 GPT-4的出现标志着人工智能发展的一个重要里程碑，它预示着AGI的可能性。然而，前进的道路充满了挑战，我们需要谨慎而积极地面对这些挑战，以期在未来的AI发展中取得更大的突破。

资源推荐

资源详情

资源评论

Sparks of Artiﬁcial General Intelligence:

Early experiments with GPT-4

S´ebastien Bubeck Varun Chandrasekaran Ronen Eldan Johannes Gehrke

Eric Horvitz Ece Kamar Peter Lee Yin Tat Lee Yuanzhi Li Scott Lundberg

Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang

Microsoft Research

Abstract

Artiﬁcial intelligence (AI) researchers have been developing and reﬁning large language models (LLMs)

that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding

of learning and cognition. The latest model developed by OpenAI, GPT-4 [Ope23], was trained using an

unprecedented scale of compute and data. In this paper, we report on our investigation of an early version

of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-

4 is part of a new cohort of LLMs (along with ChatGPT and Google’s PaLM for example) that exhibit

more general intelligence than previous AI models. We discuss the rising capabilities and implications of

these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and diﬃcult

tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any

special prompting. Moreover, in all of these tasks, GPT-4’s performance is strikingly close to human-level

performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of

GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version

of an artiﬁcial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis

on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more

comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond

next-word prediction. We conclude with reﬂections on societal inﬂuences of the recent technological leap and

future research directions.

Contents

1 Introduction 4

1.1 Our approach to studying GPT-4’s intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Organization of our demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Multimodal and interdisciplinary composition 13

2.1 Integrative ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Image generation beyond memorization . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Image generation following detailed instructions (`a la Dall-E) . . . . . . . . . . . . . . 17

2.2.3 Possible application in sketch generation . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Coding 21

3.1 From instructions to code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Coding challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.2 Real world scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Understanding existing code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

arXiv:2303.12712v2 [cs.CL] 24 Mar 2023

4 Mathematical abilities 30

4.1 A mathematical conversation with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 A ﬁrst generalization of the original question . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.2 A second variant of the original question . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.3 Analysis of the limitations highlighted by conversation . . . . . . . . . . . . . . . . . . 34

4.2 Performance on mathematical problem datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Mathematical modeling in various domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Higher level mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Interaction with the world 43

5.1 Tool use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1.1 Using multiple tools to solve more complex tasks . . . . . . . . . . . . . . . . . . . . . 44

5.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Embodied Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.1 Warmup: navigating a map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.2 Text-based games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.3 Real world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Interaction with humans 54

6.1 Understanding Humans: Theory of Mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1.1 Testing speciﬁc aspects of theory of mind . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1.2 Testing theory of mind in realistic scenarios . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Talking to Humans: Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Discriminative Capabilities 69

7.1 PII Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.2 Misconceptions and Fact-Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.2.1 Why Are Current Metrics Insuﬃcient? . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.2.2 GPT-4 as a Judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8 Limitations of autoregressive architecture highlighted by GPT-4 76

8.1 Warm-up with two basic examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8.2 Lack of planning in arithmetic/reasoning problems . . . . . . . . . . . . . . . . . . . . . . . . 77

8.3 Lack of planning in text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

9 Societal inﬂuences 82

9.1 Challenges of erroneous generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

9.2 Misinformation and manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

9.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

9.4 Human expertise, jobs, and economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

9.5 Constellation of inﬂuences and considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

10 Directions and Conclusions 92

10.1 Deﬁnitions of intelligence, AI, and AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

10.2 On the path to more general artiﬁcial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 93

10.3 What is actually happening? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A GPT-4 has common sense grounding 101

B Appendix for multimodal and interdisciplinary composition 105

B.1 Further details on integrative ability results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

B.2 Further details on vision results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

B.3 Graphic novel design example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Something unknown is doing we don’t know what.

– Sir Arthur Eddington

1 Introduction

Intelligence is a multifaceted and elusive concept that has long challenged psychologists, philosophers, and

computer scientists. An attempt to capture its essence was made in 1994 by a group of 52 psychologists

who signed onto a broad deﬁnition published in an editorial about the science of intelligence [Got97]. The

consensus group deﬁned intelligence as a very general mental capability that, among other things, involves the

ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn

from experience. This deﬁnition implies that intelligence is not limited to a speciﬁc domain or task, but rather

encompasses a broad range of cognitive skills and abilities. Building an artiﬁcial system that exhibits the kind

of general intelligence captured by the 1994 consensus deﬁnition is a long-standing and ambitious goal of AI

research. In early writings, the founders of the modern discipline of artiﬁcial intelligence (AI) research called

out sets of aspirational goals for understanding intelligence [MMRS06]. Over decades, AI researchers have

pursued principles of intelligence, including generalizable mechanisms for reasoning (e.g., [NSS59], [LBFL93])

and construction of knowledge bases containing large corpora of commonsense knowledge [Len95]. However,

many of the more recent successes in AI research can be described as being narrowly focused on well-deﬁned

tasks and challenges, such as playing chess or Go, which were mastered by AI systems in 1996 and 2016,

respectively. In the late-1990s and into the 2000s, there were increasing calls for developing more general

AI systems (e.g., [SBD

96]) and scholarship in the ﬁeld has sought to identify principles that might underly

more generally intelligent systems (e.g., [Leg08, GHT15]). The phrase, “artiﬁcial general intelligence” (AGI),

was popularized in the early-2000s (see [Goe14]) to emphasize the aspiration of moving from the “narrow

AI”, as demonstrated in the focused, real-world applications being developed, to broader notions of intelli-

gence, harkening back to the long-term aspirations and dreams of earlier AI research. We use AGI to refer

to systems that demonstrate broad capabilities of intelligence as captured in the 1994 deﬁnition above, with

the additional requirement, perhaps implicit in the work of the consensus group, that these capabilities are

at or above human-level. We note however that there is no single deﬁnition of AGI that is broadly accepted,

and we discuss other deﬁnitions in the conclusion section.

The most remarkable breakthrough in AI research of the last few years has been the advancement of

natural language processing achieved by large language models (LLMs). These neural network models are

based on the Transformer architecture [VSP

17] and trained on massive corpora of web-text data, using at its

core a self-supervised objective of predicting the next word in a partial sentence. In this paper, we report on

evidence that a new LLM developed by OpenAI, which is an early and non-multimodal version of GPT-4

[Ope23], exhibits many traits of intelligence, according to the 1994 deﬁnition. Despite being purely a language

model, this early version of GPT-4 demonstrates remarkable capabilities on a variety of domains and tasks,

including abstraction, comprehension, vision, coding, mathematics, medicine, law, understanding of human

motives and emotions, and more. We interacted with GPT-4 during its early development by OpenAI using

purely natural language queries (prompts)

. In Figure 1.1, we display some preliminary examples of outputs

from GPT-4, asking it to write a proof of inﬁnitude of primes in the form of a poem, to draw a unicorn in

TiKZ (a language for creating graphics in L

X), to create a complex animation in Python, and to solve

a high-school level mathematical problem. It easily succeeds at all these tasks, and produces outputs that

are essentially indistinguishable from (or even better than) what humans could produce. We also compare

GPT-4’s performance to those of previous LLMs, most notably ChatGPT, which is a ﬁne-tuned version of (an

improved) GPT-3 [BMR

20]. In Figure 1.2, we display the results of asking ChatGPT for both the inﬁni-

tude of primes poem and the TikZ unicorn drawing. While the system performs non-trivially on both tasks,

there is no comparison with the outputs from GPT-4. These preliminary observations will repeat themselves

throughout the paper, on a great variety of tasks. The combination of the generality of GPT-4’s capabilities,

with numerous abilities spanning a broad swath of domains, and its performance on a wide spectrum of tasks

at or beyond human-level, makes us comfortable with saying that GPT-4 is a signiﬁcant step towards AGI.

As GPT-4 ’s development continued after our experiments, one should expect diﬀerent responses from the ﬁnal version of GPT-

4. In particular, all quantitative results should be viewed as estimates of the model’s potential, rather than deﬁnitive numbers. We

repeat this caveat throughout the paper to clarify that the experience on the deployed model may diﬀer. Moreover we emphasize

that the version we tested was text-only for inputs, but for simplicity we refer to it as GPT-4 too.

GPT-4

Prompt: Can you write a proof that there are inﬁnitely

many primes, with every line that rhymes?

GPT-4:

Yes, I think I can,

Though it might take a clever plan.

I’ll start by noting Euclid’s proof,

Which shows that primes aren’t just aloof.

Assume we have a finite list,

Of primes, and that none have been missed.

Multiply them all together,

And add one, just to be clever.

The result will be a number,

That has no prime factors, oh wonder!

But every number has a prime,

That divides it, it’s just a matter of time.

So we’ve found a contradiction,

And our finite list needs eviction.

There must be infinitely many primes,

And that’s the end of my rhyming lines.s

Prompt: Draw a unicorn in TiKZ.

GPT-4: [Produces L

X compiling to

following picture.]

Prompt: Can you generate a pyplot for the following data: x = [1, 3, 5, 6, 8], y1 = [2, 3, 5, 18, 1], y2 = [3, 5, 6, 8, 1], y3 =

[5, 1, 2, 3, 4], y4 = [9, 7, 2, 3, 1]. I want y1, y2 to be in the same plot, but y3 is in another plot next to that plot, y4 is in

below. I want the legend of y1 to be ”bob”, y2 to be ”alice”, y3 to be ”bilbo”, y4 to be ”allie”. I want the x-axis to be

labeled with ”time” and y axis to be labeled with ”money”. I want to add a 10 %-40% random error bar to each curve,

through all times (including non-integers). I want smoothed curves to show the plot, and smoothed error bar. Do not

use linear interpolation, use smooth interpolation! I want to also add some small zig-zag to the smoothed curve to make

it look more real. I want to put a baseline as the mean of each line. I want to put a pie chart below indicating the

fraction of the four people in each time step. I also want an animation to show how the fractions are changing in the

pie chart in **continuous time**. Interpolate the missing fractions! I also want vertical line animation in other three

plots to match the pie chart. I want the fanciest plot. Please add as many fancy things as possible.

Here is a possible code to generate a pyplot for the following data, with some comments to explain the steps:

GPT-4: [We give two snapshots from the animation resulting from the produced code]

Prompt: Suppose g(x) = f

−1

(x), g(0) = 5, g(4) = 7, g(3) = 2, g(7) = 9, g(9) = 6 what is f(f(f (6)))?

GPT-4: There are diﬀerent ways to approach this problem, but one possible method is to use the fact that g(x) =

−1

(x) implies that f (g(x)) = x for any x in the domain of g. That is, applying f to both sides of the equation

g(x) = f

−1

(x) gives f (g(x)) = f (f

−1

(x)) = x. This means that we can undo the eﬀect of f by applying g, and vice

versa. Using this idea, we can try to ﬁnd f(6) by looking for the value of x such that g(x) = 6. From the given

information, we see that g(9) = 6, so f (6) = f(g(9)) = 9. Similarly, we can ﬁnd f (9) by looking for the value of x such

that g(x) = 9. We see that g(7) = 9, so f (9) = f (g(7)) = 7. Finally, we can ﬁnd f(7) by looking for the value of x such

that g(x) = 7. We see that g(4) = 7, so f (7) = f (g(4)) = 4. Therefore, f(f(f(6))) = f(f(9)) = f(7) = 4.

Figure 1.1: Preliminary examples of GPT-4’s capabilities in language, vision, coding, and mathe-

matics.

剩余153页未读，继续阅读

评论收藏

内容反馈

qq_19388797

粉丝: 1
资源: 9

英文版：微软154页测评GPT-4

中文版：微软154页测评GPT-4

154页微软GPT研究报告：人工通用智能的火花，GPT-4的早期实验（英文）.pdf

《微软-GPT-4-报告》-154页-英文PDF-文件.rar

关于GPT-4最完整的解读：微软154页报告全面揭示了GPT-4的新能力和缺陷（全文下载）

中英两语 GPT-4技术报告

《微软 GPT-4 报告》 154页 英文PDF 文件

微软GPT研究报告英文版

154页微软GPT-4研究报告（全中文版）.zip

(全中文版)《ChatGPT-4，通用人工智能的火花》154页微软GPT研究报告

微软研究团队用 154 页的论文报告-GPT-4 语言模型

154页微软GPT研究报告：人工通用智能的火花，GPT-4的早期实验（中文版）.pdf

微软GPT研究报告：人工通用智能的火花，GPT-4的早期实验（中文版）

Developing Apps with GPT-4 and ChatGPT-使用GPT-4和ChatGPT开发应用程序-by

微软GPT-4研究报告中文版.pdf

《微软-GPT-4-报告》-154页-中文PDF-文件.rar

微软154页研究刷屏：GPT-4能力接近人类，“天网”初现？

初步评测：“学霸”GPT-4更准确、专业、幽默，但仍然有缺陷.docx

微软：人工通用智能的火花 GPT-4的早期实验

通用人工智能的火花：GPT-4早期实验

154页微软GPT研究报告（全中文版）

chatGPT/GPT-4微软报告154页pdf《通用人工智能火花：GPT-4早期试验》

154页微软GPT研究报告（全中文版）.pdf

微软GPT研究报告（全中文版）154页.pdf

完整版劲爆！154页微软GPT研究报告（全中文版）

微软研究院：人工通用智能的星星之火-GPT-4的早期实验.pdf

用GPT-4解释“GPT-4”是什么 又有什么优势

金山办公：未来会接入GPT-4

微软154页研究刷屏：GPT-4能力接近人类，「天网」初现？

GPT-4-32K：再一次刷新你的认知

154页微软GPT研究报告：人工通用智能的火花，GPT-4的早期实验（中文版）

最新资源

《微软 GPT-4 报告》 154页英文PDF 文件

用GPT-4解释“GPT-4”是什么又有什么优势