MathematicalCapabilitiesofChatGPT.pdf资源-CSDN文库

人工智能

需积分: 1 21 浏览量 2023-05-18 14:42:48 上传评论收藏 966KB PDF 举报

资源推荐

资源详情

资源评论

Mathematical Capabilities of ChatGPT

Simon Frieder

∗1,4

, Luca Pinchetti

, Ryan-Rhys Griﬃths

, Tommaso Salvatori

, Thomas

Lukasiewicz

1,2

, Philipp Christian Petersen

4,5

, Alexis Chevalier

, and Julius Berner

Department of Computer Science, University of Oxford, Oxford, UK

Institute of Logic and Computation, TU Wien, Vienna, Austria

Department of Physics, University of Cambridge, Cambridge, UK

Faculty of Mathematics, University of Vienna, Vienna, Austria

Research Network Data Science, University of Vienna, Vienna, Austria

School of Mathematics, Institute for Advanced Study, Princeton, US

February 1, 2023

Abstract

We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as

well as hand-crafted ones, and measuring its performance against other models trained on a mathematical

corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional

mathematicians by emulating various use cases that come up in the daily professional activities of

mathematicians (question answering, theorem searching). In contrast to formal mathematics, where

large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of

natural-language mathematics, used to benchmark language models, only cover elementary mathematics.

We address this issue by introducing a new dataset: GHOSTS. It is the ﬁrst natural-language dataset made

and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics

and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark

ChatGPT on GHOSTS and evaluate performance against ﬁne-grained criteria. We make this new

dataset publicly available

to assist a community-driven comparison of ChatGPT with (future) large

language models in terms of advanced mathematical comprehension. We conclude that contrary to many

positive reports in the media (a potential case of selection bias), ChatGPT’s mathematical abilities are

signiﬁcantly below those of an average mathematics graduate student. Our results show that ChatGPT

often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to

pass a university exam, you would be better oﬀ copying from your average peer!

1 Introduction

Since its introduction, ChatGPT has rapidly become a widely known question-and-answer dialogue system.

It has been mentioned in traditional media across the globe [

] and across all major internet

platforms [

]. According to Twitter data, it is by far the most talked about language model to date; cf.

Figure 1.

The performance of ChatGPT has been analyzed in a large number of exam-related use cases, with varying

degrees of scientiﬁc rigor, ranging from detailed studies to anecdotal evidence. Use cases include passing the

United States Medical Licensing Examination [

], scoring highly on the Psychology Today Verbal-Linguistic

Intelligence IQ Test [

], and answering (and generating) Operations Management exam questions that were

∗

Corresponding author: simon.frieder@wolfson.ox.ac.uk. The subsequent author list is ordered randomly.

github.com/friederrr/science-GHOSTS

arXiv:2301.13867v1 [cs.LG] 31 Jan 2023









































Figure 1: Twitter count data relating the counts of a selection of notable large language models from the

beginning of the release date of GPT-3. The

-axis is a

log

-scaled. ChatGPT counts by far dominate those

of all other language models. Vertical year-ticks denote the end of the mentioned year.

deemed to be within the scope of a typical MBA curriculum [

], all with a performance that elicited positive

surprise of the authors. Due to this and other reasons, it is widely believed that large language models

(LLMs) will impact a large number of areas and will be used as assistants by many professionals.

In this article, we will focus on performing a detailed analysis of the mathematical capabilities of ChatGPT.

This includes, but is not limited to, answering exam-style mathematical questions and investigating how

ChatGPT behaves in a number of mathematical contexts. Our analysis includes testing how many of the

skills ChatGPT can emulate that are necessary to do professional mathematics. Examples of such skills are

the ability to answer computational questions (“What is the value of

arccos

(

cos x

1+2 cos x

)

?”), the ability

to complete mathematical proofs that have gaps or missing steps, the ability to solve questions that are

more focused on deep insights and original solutions, such as those of mathematical Olympiads, and the

ability to survey the literature and think across domains (“which other theorems do we need to prove a given

theorem?”).

To do this, we have designed a thorough testing methodology to evaluate the outputs of ChatGPT, including

error codes that represent various possible failure modes of ChatGPT (see Section 3). We score ChatGPT’s

responses, we report on the results using this methodology, and we compare ChatGPT to state-of-the-art

models trained for mathematical comprehension.

Moreover, we have created new datasets of prompts that are aimed at testing speciﬁc aspects of ChatGPT

related to mathematical comprehension. We evaluate ChatGPT by comparing it to random samples

from existing datasets that were devised to test models that were speciﬁcally trained for mathematical

comprehension [

]. A number of the datasets are speciﬁcally designed so that the questions could not be

answered if ChatGPT were memorizing the results. All of those datasets are created by the authors.

In summary, the contributions of this article are threefold:

•

First, insight for mathematical use is provided. We show for which types of questions and which

domains of mathematics, ChatGPT may be useful and how it could be integrated into the workﬂow of

a mathematician.

•

Second, the failure modes of ChatGPT are identiﬁed, as well as the limits of its capabilities. This

can aid future eﬀorts to develop LLMs that perform better in mathematics. Our analysis is akin to

a mathematical model card, where the mathematical strengths and weaknesses are summarized (see

Section 4).

•

Third, we provide benchmarks for testing the mathematical capabilities of future LLMs so that they can

be compared to ChatGPT across a range of aspects regarding advanced mathematical comprehension.

This is achieved by introducing new natural-language math datasets. Two of these benchmarks are

derived from the most advanced datasets regarding mathematical queries for language models that exist

today. Additionally, we devise four more datasets on which we benchmark ChatGPTs performance. We

release the collection of these datasets publicly on GitHub

, and we encourage community participation

by allowing GitHub pull requests in order to grow the datasets beyond their current sizes.

2 Related Work

As a large language model, ChatGPT can be universally employed to perform mathematical reasoning

and therefore has to compare with technologies that in this space are sometimes decades old. Performing

mathematical reasoning in an automated way has a long history and can be traced back to 1959 [

], the most

focus being devoted to proving theorems [11]. Presently, there is a realization that the classical approaches,

using a symbolic encoding of mathematics, have reached a plateau [14].

There is now a growing body of literature on learning mathematical relationships directly in a supervised-

learning manner [

] or by using LLMs to perform mathematical reasoning directly on mathematics

encoded in natural language [

]. Sometimes, the distinction is blurred because Transformers can also

be used in a supervised-learning setting and have been employed successfully in learning mathematical

relationships [18, 6].

Most recently published large language models, such as PaLM [

], released in 2022, are tested only on

elementary-level mathematical reasoning datasets, such as the GSM8K dataset [

]. We speculate this is

because the obtained results already suggest that the models struggle on much simpler datasets than ours,

such as the MathQA [

] dataset or the GSM8K dataset [

], respectively. For example, the version of PaLM

with 540 billion parameters with chain-of-thought prompting and access to an external calculator solves

only 58% on the GSM8K dataset [

, Table 10]. This model nonetheless outperforms GPT-3 [

] on the same

dataset, which only solves at best 54%; this performance is consistent with the performance of older models.

Variations of BERT [

] have been shown to only solve between 28% and 37% of the problems when ﬁne-tuned

and tested on the

AQuA-RAT

dataset [

], which is the direct predecessor of MathQA. In some cases, such as

the LaMDA model [

] or BLOOM [

], both released also in 2022 by Google, an evaluation of mathematical

reasoning capability is missing entirely.

Among the mentioned LLMs, Minerva [

], based on PaLM, stands out, being trained in equal parts on websites

that contain MathJax elements and arXiv preprints (and on general natural language data on which PaLM

was trained), achieving a score of roughly 50% on a signiﬁcantly harder dataset, the MATH (Mathematics

Aptitude Test of Heuristics) dataset [16] that was sourced from various mathematical competitions.

One distinguishing feature of the MATH dataset [

] is that its problems admit 1) a unique answer (no

open-ended questions) 2) and the answer can be condensed within a few characters (a number, for example).

This is beneﬁcial in terms of automatic evaluation of a model on such a dataset since one can simply ask

for the ﬁnal answer, ignoring the step-by-step solution (moreover, one can train models, as [

] do, to ﬁt

this style of inquiry and output either the ﬁnal solution only, or the step-by-step derivation leading to the

solution).

Among the supervised approaches, we mention [

], where a Transformer architecture was used to generate

symbolic solutions to integrating functions and ﬁnding closed-form solutions to ﬁrst-order and second-order

diﬀerential equations, which outperformed classical solvers, such as Mathematica, MATLAB, and Maple

github.com/friederrr/science-GHOSTS

by at least 14% on a test set of integration problems. On the task of solving diﬀerential equations, the

Transformer-based approach still exceeds the classical approach, but by a smaller margin (at least 4% in

the case of ﬁrst-order diﬀerential equations and with more varied results for second-order equations). An

up-to-date survey on mathematical datasets and performance of various LLMs can be found in [23].

For ChatGPT, most investigations related to mathematical reasoning consist to date of anecdotal evidence

concerning its performance and its failure modes; see, e.g., [

]. Unfortunately, a clear methodology

is missing, as most of the results are scattered on various internet platforms and are not easily reproducible.

To the best of our knowledge, the only mathematical investigation was undertaken in [

], which mainly

investigated ChatGPT’s capability to compute irrational numbers to high accuracy.

On the other hand, we like to mention the case of formalized mathematics, where large databases that encode

advanced mathematical concepts exist, e.g. the Lean Mathematical Library [

]. Some of the ideas that we

have used in this article, such as prompting with missing proofs, are echoed in [

] for formal mathematics.

Yet, for the purpose of doing mathematics with large language models, these formal datasets cannot be

leveraged since no straightforward way exists to convert them to natural language (in addition to various

issues, such as bias, that might occur in the context of an automatic conversion).

3 Datasets

3.1 Dataset creation

We assess the mathematical reasoning capabilities of ChatGPT by creating a collection of multiple datasets

of prompts, totaling 728 prompts, for which ChatGPT’s output was manually rated by experts. Then, we

record and rate each of the outputs provided by the model. The combined eﬀort of devising mathematically

insightful prompts, some of which are at graduate-level mathematics, and carefully rating the the output of

ChatGPT amount to several hundreds of person-hours.

We divide our entire collection of prompts into six subdatasets

, called

• Grad-Text

• H oles-in-Proofs

• Olympiad-Problem-Solving

• Symbolic-Integration

• MATH

• Search-Engine-Aspects

We summarize those in Table 1. The letters that are set in boldface make up the GHOSTS acronym.

Two of the subdatasets, the MATH subdataset and the Symbolic-Integration subdataset, use prompts taken

from existing datasets, [

] and [

], respectively. This was done in order to be able to compare how ChatGPT

performs against existing state-of-the-art models, one based on an LLM, Minvera [

], and one based on a

supervised-learning approach [

]. Nonetheless, signiﬁcant, additional annotation eﬀort was involved since in

both cases the authors, as experts in the ﬁeld, rated the output. Furthermore, in the second case, a conversion

from Polish notation was necessary.

The other subdatasets were hand-crafted by the authors. We note that it is neither possible to outsource the

creation of these datasets to a crowdsourcing service, such as Amazon Mechanical Turk, nor is it possible to

generate these datasets automatically from code because advanced mathematical insight is required for the

creation of each prompt (though based on our work, it might be possible to extend the dataset by creating

variations of our questions in a purely programmatic manner; see Section 5). Furthermore, unlike in the case

In the GitHub repository, each subdataset corresponds to a folder, which in turn can consist of multiple ﬁles.

Dataset name Comprised of the ﬁle(s) Tags

Grad-Text W. Rudin, Functional Analysis (ch. 1) M3 Q4

W. Rudin, Functional Analysis (ch. 2) M3 Q4

J. Munkres, Topology (ch. 1) M3 Q4

J. Munkres, Topology (ch. 2) M3 Q4

R. Durret, Probability Theory M3 Q4

H oles-in-Proofs Proofs Collection A M3 Q2 Q5

Proofs Collection B Prealgebra M1 Q5

Proofs Collection B Precalculus M1 Q5

Olympiad-Problem-Solving Olympiad Problem Solving M4 Q4 D2

Symbolic-Integration Symbolic Integration M2 Q3 D1

MAT H MATH Algebra M1 M2 M3 Q3 Q4

MATH Counting and Probability M1 M2 M3 Q3 Q4

MATH Prealgebra M1 Q3 Q4

MATH Precalculus M1 Q3 Q4

Search-Engine-Aspects Deﬁnition Retrieval M3 Q1 Q2 D3

Reverse Deﬁnition Retrieval M3 Q2 D3

Named Theorem Proof Completion M3 Q1 Q2 D3

Table 1: A summary of all datasets, together with their associated tags. The tags M

, Q

, and D

relate to

the level of M athematical diﬃculty, the Question type, and the Out-of-Distribution type from Section 3.3,

respectively.

of the MATH dataset by [

] (see Section 2), the output of ChatGPT cannot be automatically evaluated and

a professional opinion on its output needs to be given. This raises the diﬃculty of the creation of more data,

since graduate-level mathematics (and in some cases PhD-level mathematics) is required. The mathematical

skill level of the authors matches this requirement.

Our dataset goes beyond all the mentioned mathematical datasets for LLMs in Section 2 in terms of

mathematical sophistication and in terms of the diﬀerent aspects of mathematical reasoning that are being

tested. It also surpasses datasets mentioned in the survey [

] along these dimensions. Futhermore, unlike for

the MATH dataset by [

], a large number of our questions are not of the type of having an answer that

can be condensed into a few tokens (e.g., when the answer is a mathematical proof), such as a number or a

function, and therefore evaluation cannot be performed automatically.

3.2 Format

The format of each of our subdatasets that make up our GHOSTS dataset follows the same convention

and each subdataset consists of (potentially multiple) JSON-formatted ﬁles with entries such as in the

example below. Our format is similar to, e.g., the

AQuA-RAT

(Algebra Question Answering with Rationales)

dataset [21]. A single datapoint in a JSON ﬁle has the following form:

{

"prompt " : "Can you q uot e a famous m at he ma ti ca l theorem ?" ,

" outp ut " : " Euc li d ’ s Elements , P r o p o s i t i o n 4 7 : In r i g h t −a n g l e d t r i a n g l e s , t he

s q u a r e on th e s i d e o p p o s i t e th e r i g h t an g l e i s e q u a l t o th e sum o f th e s q a r e s

on th e o t h e r two s i d e s . " ,

" r a t i n g " : " 5" ,

" e r r o r c o d e s " : [ "" ] ,

" w ar n in g co d es " : [ "w1" ] ,

"comment" : " Th i s i s a c t u a l l y P y thag o ras ’ theorem , which was not mentioned , henc e

on l y a war n ing f l a g i s r a i s e d . " ,

" r e f " : " "

剩余28页未读，继续阅读

评论收藏

内容反馈

Java老徐

粉丝: 1458
资源: 2045

Mathematical Capabilities of ChatGPT.pdf

最新资源

Mathematical Capabilities of ChatGPT.pdf

A Mathematical Theory of Communication.pdf

Mathematical Foundations of Computing.pdf

Mathematical建模教材目录.pdf

Mathematical Optimal Control Theory.pdf

Tom Apostol Mathematical Analysis 2ed.pdf

Fractal_Geometry_Mathematical foundations and application.pdf

Handbook of Mathematical Functions with Formulas...

Mathematical Logic.pdf 高清

List of mathematical symbols - Wikipedia.pdf

A.Course.in.Mathematical.Logic.for.Mathematicians.1441906142.pdf

Nucleus-JosStam and Mathematical.Methods.of.Classical.Mechanics.(2nd.Edition)

英文版Handbook of Mathematical Methods in Imaging.pdf

Mathematical_Foundations_of_Elasticity.pdf

3D Computer Graphics - A Mathematical Introduction With OpenGL.pdf

015 Approximation Theorems of Mathematical Statistics.pdf

Unsolved Problems in Mathematical Systems and Control Theory.pdf

Mathematical Foundations of Computer Networking 无水印pdf

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

农村公交与异构无人机协同配送优化

李飞飞自传 我看见的世界 The World I see

4个亲测好用的ChatGPT4渠道

2023泛娱乐社交出海手册-ZEGO即构科技

最新资源

李飞飞自传我看见的世界 The World I see