没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Mathematical Capabilities of ChatGPT
Simon Frieder
∗1,4
, Luca Pinchetti
1
, Ryan-Rhys Griffiths
3
, Tommaso Salvatori
2
, Thomas
Lukasiewicz
1,2
, Philipp Christian Petersen
4,5
, Alexis Chevalier
6
, and Julius Berner
4
1
Department of Computer Science, University of Oxford, Oxford, UK
2
Institute of Logic and Computation, TU Wien, Vienna, Austria
3
Department of Physics, University of Cambridge, Cambridge, UK
4
Faculty of Mathematics, University of Vienna, Vienna, Austria
5
Research Network Data Science, University of Vienna, Vienna, Austria
6
School of Mathematics, Institute for Advanced Study, Princeton, US
February 1, 2023
Abstract
We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as
well as hand-crafted ones, and measuring its performance against other models trained on a mathematical
corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional
mathematicians by emulating various use cases that come up in the daily professional activities of
mathematicians (question answering, theorem searching). In contrast to formal mathematics, where
large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of
natural-language mathematics, used to benchmark language models, only cover elementary mathematics.
We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made
and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics
and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark
ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new
dataset publicly available
1
to assist a community-driven comparison of ChatGPT with (future) large
language models in terms of advanced mathematical comprehension. We conclude that contrary to many
positive reports in the media (a potential case of selection bias), ChatGPT’s mathematical abilities are
significantly below those of an average mathematics graduate student. Our results show that ChatGPT
often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to
pass a university exam, you would be better off copying from your average peer!
1 Introduction
Since its introduction, ChatGPT has rapidly become a widely known question-and-answer dialogue system.
It has been mentioned in traditional media across the globe [
33
,
28
,
22
] and across all major internet
platforms [
40
,
43
]. According to Twitter data, it is by far the most talked about language model to date; cf.
Figure 1.
The performance of ChatGPT has been analyzed in a large number of exam-related use cases, with varying
degrees of scientific rigor, ranging from detailed studies to anecdotal evidence. Use cases include passing the
United States Medical Licensing Examination [
17
], scoring highly on the Psychology Today Verbal-Linguistic
Intelligence IQ Test [
34
], and answering (and generating) Operations Management exam questions that were
∗
Corresponding author: simon.frieder@wolfson.ox.ac.uk. The subsequent author list is ordered randomly.
1
github.com/friederrr/science-GHOSTS
1
arXiv:2301.13867v1 [cs.LG] 31 Jan 2023
Figure 1: Twitter count data relating the counts of a selection of notable large language models from the
beginning of the release date of GPT-3. The
x
-axis is a
log
-scaled. ChatGPT counts by far dominate those
of all other language models. Vertical year-ticks denote the end of the mentioned year.
deemed to be within the scope of a typical MBA curriculum [
41
], all with a performance that elicited positive
surprise of the authors. Due to this and other reasons, it is widely believed that large language models
(LLMs) will impact a large number of areas and will be used as assistants by many professionals.
In this article, we will focus on performing a detailed analysis of the mathematical capabilities of ChatGPT.
This includes, but is not limited to, answering exam-style mathematical questions and investigating how
ChatGPT behaves in a number of mathematical contexts. Our analysis includes testing how many of the
skills ChatGPT can emulate that are necessary to do professional mathematics. Examples of such skills are
the ability to answer computational questions (“What is the value of
R
π
2
0
arccos
(
cos x
1+2 cos x
)
dx
?”), the ability
to complete mathematical proofs that have gaps or missing steps, the ability to solve questions that are
more focused on deep insights and original solutions, such as those of mathematical Olympiads, and the
ability to survey the literature and think across domains (“which other theorems do we need to prove a given
theorem?”).
To do this, we have designed a thorough testing methodology to evaluate the outputs of ChatGPT, including
error codes that represent various possible failure modes of ChatGPT (see Section 3). We score ChatGPT’s
responses, we report on the results using this methodology, and we compare ChatGPT to state-of-the-art
models trained for mathematical comprehension.
Moreover, we have created new datasets of prompts that are aimed at testing specific aspects of ChatGPT
related to mathematical comprehension. We evaluate ChatGPT by comparing it to random samples
from existing datasets that were devised to test models that were specifically trained for mathematical
comprehension [
18
,
16
]. A number of the datasets are specifically designed so that the questions could not be
answered if ChatGPT were memorizing the results. All of those datasets are created by the authors.
In summary, the contributions of this article are threefold:
•
First, insight for mathematical use is provided. We show for which types of questions and which
domains of mathematics, ChatGPT may be useful and how it could be integrated into the workflow of
a mathematician.
•
Second, the failure modes of ChatGPT are identified, as well as the limits of its capabilities. This
can aid future efforts to develop LLMs that perform better in mathematics. Our analysis is akin to
2
a mathematical model card, where the mathematical strengths and weaknesses are summarized (see
Section 4).
•
Third, we provide benchmarks for testing the mathematical capabilities of future LLMs so that they can
be compared to ChatGPT across a range of aspects regarding advanced mathematical comprehension.
This is achieved by introducing new natural-language math datasets. Two of these benchmarks are
derived from the most advanced datasets regarding mathematical queries for language models that exist
today. Additionally, we devise four more datasets on which we benchmark ChatGPTs performance. We
release the collection of these datasets publicly on GitHub
2
, and we encourage community participation
by allowing GitHub pull requests in order to grow the datasets beyond their current sizes.
2 Related Work
As a large language model, ChatGPT can be universally employed to perform mathematical reasoning
and therefore has to compare with technologies that in this space are sometimes decades old. Performing
mathematical reasoning in an automated way has a long history and can be traced back to 1959 [
37
], the most
focus being devoted to proving theorems [11]. Presently, there is a realization that the classical approaches,
using a symbolic encoding of mathematics, have reached a plateau [14].
There is now a growing body of literature on learning mathematical relationships directly in a supervised-
learning manner [
2
,
10
,
15
] or by using LLMs to perform mathematical reasoning directly on mathematics
encoded in natural language [
20
]. Sometimes, the distinction is blurred because Transformers can also
be used in a supervised-learning setting and have been employed successfully in learning mathematical
relationships [18, 6].
Most recently published large language models, such as PaLM [
8
], released in 2022, are tested only on
elementary-level mathematical reasoning datasets, such as the GSM8K dataset [
9
]. We speculate this is
because the obtained results already suggest that the models struggle on much simpler datasets than ours,
such as the MathQA [
1
] dataset or the GSM8K dataset [
9
], respectively. For example, the version of PaLM
with 540 billion parameters with chain-of-thought prompting and access to an external calculator solves
only 58% on the GSM8K dataset [
8
, Table 10]. This model nonetheless outperforms GPT-3 [
5
] on the same
dataset, which only solves at best 54%; this performance is consistent with the performance of older models.
Variations of BERT [
30
] have been shown to only solve between 28% and 37% of the problems when fine-tuned
and tested on the
AQuA-RAT
dataset [
21
], which is the direct predecessor of MathQA. In some cases, such as
the LaMDA model [
42
] or BLOOM [
19
], both released also in 2022 by Google, an evaluation of mathematical
reasoning capability is missing entirely.
Among the mentioned LLMs, Minerva [
20
], based on PaLM, stands out, being trained in equal parts on websites
that contain MathJax elements and arXiv preprints (and on general natural language data on which PaLM
was trained), achieving a score of roughly 50% on a significantly harder dataset, the MATH (Mathematics
Aptitude Test of Heuristics) dataset [16] that was sourced from various mathematical competitions.
One distinguishing feature of the MATH dataset [
16
] is that its problems admit 1) a unique answer (no
open-ended questions) 2) and the answer can be condensed within a few characters (a number, for example).
This is beneficial in terms of automatic evaluation of a model on such a dataset since one can simply ask
for the final answer, ignoring the step-by-step solution (moreover, one can train models, as [
16
] do, to fit
this style of inquiry and output either the final solution only, or the step-by-step derivation leading to the
solution).
Among the supervised approaches, we mention [
18
], where a Transformer architecture was used to generate
symbolic solutions to integrating functions and finding closed-form solutions to first-order and second-order
differential equations, which outperformed classical solvers, such as Mathematica, MATLAB, and Maple
2
github.com/friederrr/science-GHOSTS
3
by at least 14% on a test set of integration problems. On the task of solving differential equations, the
Transformer-based approach still exceeds the classical approach, but by a smaller margin (at least 4% in
the case of first-order differential equations and with more varied results for second-order equations). An
up-to-date survey on mathematical datasets and performance of various LLMs can be found in [23].
For ChatGPT, most investigations related to mathematical reasoning consist to date of anecdotal evidence
concerning its performance and its failure modes; see, e.g., [
43
,
32
,
24
,
40
]. Unfortunately, a clear methodology
is missing, as most of the results are scattered on various internet platforms and are not easily reproducible.
To the best of our knowledge, the only mathematical investigation was undertaken in [
4
], which mainly
investigated ChatGPT’s capability to compute irrational numbers to high accuracy.
On the other hand, we like to mention the case of formalized mathematics, where large databases that encode
advanced mathematical concepts exist, e.g. the Lean Mathematical Library [
25
]. Some of the ideas that we
have used in this article, such as prompting with missing proofs, are echoed in [
31
] for formal mathematics.
Yet, for the purpose of doing mathematics with large language models, these formal datasets cannot be
leveraged since no straightforward way exists to convert them to natural language (in addition to various
issues, such as bias, that might occur in the context of an automatic conversion).
3 Datasets
3.1 Dataset creation
We assess the mathematical reasoning capabilities of ChatGPT by creating a collection of multiple datasets
of prompts, totaling 728 prompts, for which ChatGPT’s output was manually rated by experts. Then, we
record and rate each of the outputs provided by the model. The combined effort of devising mathematically
insightful prompts, some of which are at graduate-level mathematics, and carefully rating the the output of
ChatGPT amount to several hundreds of person-hours.
We divide our entire collection of prompts into six subdatasets
3
, called
• Grad-Text
• H oles-in-Proofs
• Olympiad-Problem-Solving
• Symbolic-Integration
• MATH
• Search-Engine-Aspects
We summarize those in Table 1. The letters that are set in boldface make up the GHOSTS acronym.
Two of the subdatasets, the MATH subdataset and the Symbolic-Integration subdataset, use prompts taken
from existing datasets, [
16
] and [
18
], respectively. This was done in order to be able to compare how ChatGPT
performs against existing state-of-the-art models, one based on an LLM, Minvera [
20
], and one based on a
supervised-learning approach [
18
]. Nonetheless, significant, additional annotation effort was involved since in
both cases the authors, as experts in the field, rated the output. Furthermore, in the second case, a conversion
from Polish notation was necessary.
The other subdatasets were hand-crafted by the authors. We note that it is neither possible to outsource the
creation of these datasets to a crowdsourcing service, such as Amazon Mechanical Turk, nor is it possible to
generate these datasets automatically from code because advanced mathematical insight is required for the
creation of each prompt (though based on our work, it might be possible to extend the dataset by creating
variations of our questions in a purely programmatic manner; see Section 5). Furthermore, unlike in the case
3
In the GitHub repository, each subdataset corresponds to a folder, which in turn can consist of multiple files.
4
Dataset name Comprised of the file(s) Tags
Grad-Text W. Rudin, Functional Analysis (ch. 1) M3 Q4
W. Rudin, Functional Analysis (ch. 2) M3 Q4
J. Munkres, Topology (ch. 1) M3 Q4
J. Munkres, Topology (ch. 2) M3 Q4
R. Durret, Probability Theory M3 Q4
H oles-in-Proofs Proofs Collection A M3 Q2 Q5
Proofs Collection B Prealgebra M1 Q5
Proofs Collection B Precalculus M1 Q5
Olympiad-Problem-Solving Olympiad Problem Solving M4 Q4 D2
Symbolic-Integration Symbolic Integration M2 Q3 D1
MAT H MATH Algebra M1 M2 M3 Q3 Q4
MATH Counting and Probability M1 M2 M3 Q3 Q4
MATH Prealgebra M1 Q3 Q4
MATH Precalculus M1 Q3 Q4
Search-Engine-Aspects Definition Retrieval M3 Q1 Q2 D3
Reverse Definition Retrieval M3 Q2 D3
Named Theorem Proof Completion M3 Q1 Q2 D3
Table 1: A summary of all datasets, together with their associated tags. The tags M
i
, Q
i
, and D
i
relate to
the level of M athematical difficulty, the Question type, and the Out-of-Distribution type from Section 3.3,
respectively.
of the MATH dataset by [
16
] (see Section 2), the output of ChatGPT cannot be automatically evaluated and
a professional opinion on its output needs to be given. This raises the difficulty of the creation of more data,
since graduate-level mathematics (and in some cases PhD-level mathematics) is required. The mathematical
skill level of the authors matches this requirement.
Our dataset goes beyond all the mentioned mathematical datasets for LLMs in Section 2 in terms of
mathematical sophistication and in terms of the different aspects of mathematical reasoning that are being
tested. It also surpasses datasets mentioned in the survey [
23
] along these dimensions. Futhermore, unlike for
the MATH dataset by [
16
], a large number of our questions are not of the type of having an answer that
can be condensed into a few tokens (e.g., when the answer is a mathematical proof), such as a number or a
function, and therefore evaluation cannot be performed automatically.
3.2 Format
The format of each of our subdatasets that make up our GHOSTS dataset follows the same convention
and each subdataset consists of (potentially multiple) JSON-formatted files with entries such as in the
example below. Our format is similar to, e.g., the
AQuA-RAT
(Algebra Question Answering with Rationales)
dataset [21]. A single datapoint in a JSON file has the following form:
{
"prompt " : "Can you q uot e a famous m at he ma ti ca l theorem ?" ,
" outp ut " : " Euc li d ’ s Elements , P r o p o s i t i o n 4 7 : In r i g h t −a n g l e d t r i a n g l e s , t he
s q u a r e on th e s i d e o p p o s i t e th e r i g h t an g l e i s e q u a l t o th e sum o f th e s q a r e s
on th e o t h e r two s i d e s . " ,
" r a t i n g " : " 5" ,
" e r r o r c o d e s " : [ "" ] ,
" w ar n in g co d es " : [ "w1" ] ,
"comment" : " Th i s i s a c t u a l l y P y thag o ras ’ theorem , which was not mentioned , henc e
on l y a war n ing f l a g i s r a i s e d . " ,
" r e f " : " "
5
剩余28页未读,继续阅读
资源评论
Java老徐
- 粉丝: 1458
- 资源: 2045
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- MEM2307-VB一款SOT23封装P-Channel场效应MOS管
- 逐次分半积分实验.wps
- MEM2303XG-VB一款P-Channel沟道SOT23的MOSFET晶体管参数介绍与应用说明
- MEM2303-VB一款P-Channel沟道SOT23的MOSFET晶体管参数介绍与应用说明
- 1_base.apk.1
- 基于FPGA深度学习的9I2C 读写 RTC 时钟实验,适合FPGA初学者
- MEM2303M3G-VB一款P-Channel沟道SOT23的MOSFET晶体管参数介绍与应用说明
- MEM2302XG-VB一款N-Channel沟道SOT23的MOSFET晶体管参数介绍与应用说明
- manage.py 相对路径
- 数据库设计课程设计-高校选课管理系统免费提供
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功