LanguageModelsareMultilingualChain-of-ThoughtReasoners.pdf

需积分: 1 193 浏览量 2023-05-18 11:41:27 上传评论收藏 942KB PDF 举报

"Language Models are Multilingual Chain-of-Thought Reasoners" Language Models 作为多语言链式思维推理器，具有出色的多语言推理能力，即使在资源稀缺的语言中，如孟加拉语和斯瓦希里语。这项研究引入了 Multilingual Grade School Math (MGSM)基准测试，通过手动将 250 个小学数学问题从 GSM8K 数据集（Cobbe et al., 2021）翻译成十种语言。结果表明，随着模型规模的增加， Language Models 能够通过 chain-of-thought 提示解决 MGSM 问题，并且在多语言环境中具有强大的推理能力。 chain-of-thought 推理是一种特殊的推理方式，它可以将复杂的问题分解成多个简单的步骤，然后逐步解决。这种方法可以提高 Language Models 的推理能力，特别是在多语言环境中。在这项研究中，作者使用了 PaLM-540B 语言模型，并将其应用于 MGSM 基准测试。结果表明，即使在资源稀缺的语言中，Language Models 也能够表现出色的推理能力。 Furthermore，Language Models 还能够将其推理能力扩展到其他任务中，如常识推理和词语语境语义判断。这项研究为我们展示了 Language Models 的强大推理能力，并为多语言环境中的应用提供了可能性。 Language Models 作为多语言链式思维推理器的能力可以扩展到多种语言环境中，如英语、法语、德语、西班牙语、葡萄牙语、俄语、中文、日语等。这项研究引入了新的思维方式，能够帮助我们更好地理解 Language Models 的推理能力，并为多语言环境中的应用提供了新的可能性。语言模型的多语言推理能力可以应用于多种领域，如机器翻译、文本分类、命名实体识别、问答系统等。这项研究为我们展示了 Language Models 的强大推理能力，并为多语言环境中的应用提供了可能性。此外，这项研究还引入了新的基准测试 MGSM，用于评估 Language Models 的多语言推理能力。这项基准测试可以帮助我们更好地理解 Language Models 的推理能力，并为多语言环境中的应用提供了新的可能性。从技术角度来说，这项研究引入了新的思维方式，能够帮助我们更好地理解 Language Models 的推理能力，并为多语言环境中的应用提供了新的可能性。这项研究还提供了新的基准测试 MGSM，用于评估 Language Models 的多语言推理能力。这项研究展示了 Language Models 作为多语言链式思维推理器的能力，并为多语言环境中的应用提供了可能性。这项研究引入了新的思维方式和基准测试，能够帮助我们更好地理解 Language Models 的推理能力，并为多语言环境中的应用提供了新的可能性。

资源推荐

资源详情

资源评论

LANGUAGE MODELS ARE

MULTILINGUAL CHAIN-OF-THOUGHT REASONERS

Freda Shi

1,2, ∗

Mirac Suzgun

1,3,∗

Markus Freitag

Xuezhi Wang

Suraj Srivats

Soroush Vosoughi

Hyung Won Chung

Yi Tay

Sebastian Ruder

Denny Zhou

Dipanjan Das

Jason Wei

Google Research

Toyota Technological Institute at Chicago

Stanford University

Dartmouth College

ABSTRACT

We evaluate the reasoning abilities of large language models in multilingual

settings. We introduce the Multilingual Grade School Math (MGSM) bench-

mark, by manually translating 250 grade-school math problems from the GSM8K

dataset (Cobbe et al., 2021) into ten typologically diverse languages. We ﬁnd

that the ability to solve MGSM problems via chain-of-thought prompting emerges

with increasing model scale, and that models have strikingly strong multilin-

gual reasoning abilities, even in underrepresented languages such as Bengali

and Swahili. Finally, we show that the multilingual reasoning abilities of lan-

guage models extend to other tasks such as commonsense reasoning and word-

in-context semantic judgment. The MGSM benchmark is publicly available at

https://github.com/google-research/url-nlp.

0.01% 1% 100%

Underrepresented

languages

(SW, BN, TE, TH)

High-resource

languages

(JA, ZH, RU, ES, FR, DE)

English

(EN)

Frequency of language in pre-training dataset (token percentage)

MGSM Accuracy (%)

Translate to English with Google Translate and solve with English intermediate steps

Intermediate reasoning steps in the language of the question

Intermediate reasoning steps in English

Figure 1: Correlation between language frequency and MGSM accuracy for PaLM-540B. The

accuracy is surprisingly high, even for underrepresented languages like Swahili (SW) and Bengali

(BN), which account for less than 0.01% of the pre-training dataset.

∗

Equal contribution. Work done during internship at Google Research.

arXiv:2210.03057v1 [cs.CL] 6 Oct 2022

1 INTRODUCTION

Recent work has shown that presenting explicit reasoning steps (i.e., chains of thought; COT) in En-

glish elicits multi-step reasoning abilities of large language models such as GPT-3 and PaLM (Brown

et al., 2020; Chowdhery et al., 2022; Wei et al., 2022b, inter alia). Pretrained multilingual language

models have also achieved impressive performance on various NLP tasks across typologically distinct

languages (Conneau et al., 2020; Xue et al., 2021; Chowdhery et al., 2022; Clark et al., 2020; Hu

et al., 2020; Ruder et al., 2021, inter alia). Tasks in existing multilingual benchmarks usually require

only simple reasoning steps, and so it is still unclear how well language models perform on tasks that

require more complex reasoning in a multilingual setting.

In this work, we introduce the

MGSM

benchmark to bridge the gap between the progress on English-

based chain-of-thought reasoning and multilingual NLP. We extend a subset of the English-language

GSM8K dataset (Cobbe et al., 2021) to ten typologically diverse languages via manual translation of

problems into target languages. To the best of our knowledge, this is the ﬁrst multilingual benchmark

to evaluate the arithmetic reasoning abilities of language models.

We evaluate two large language models, GPT-3 (Brown et al., 2020; Ouyang et al., 2022) and PaLM

(Chowdhery et al., 2022), on this benchmark. While both models solve less than 20% of problems

with standard prompting, the 540-billion-parameter PaLM model in particular shows exceptional

multilingual reasoning abilities with intermediate reasoning steps (Figure 1), solving more than 40%

of the problems in any investigated language, including underrepresented languages such as Bengali

and Swahili. In our best setting, PaLM achieves an average solve rate of 55% across languages. We

ﬁnd that intermediate reasoning steps in English consistently lead to competitive or better results

than those written in the native language of the question, suggesting that English chain-of-thought

prompting may be a useful baseline for future multilingual reasoning work.

We further demonstrate that the multilingual reasoning abilities of pretrained models extend to

common-sense reasoning (Ponti et al., 2020) and word-in-context semantic judgment (Raganato

et al., 2020). By presenting the models with few-shot examples in different languages, PaLM sets a

new state-of-the-art performance (

89.9%

) on XCOPA (Ponti et al., 2020), outperforming the prior

approaches that require thousands of training examples.

2 THE MGSM BENCHMARK

In this section, we describe the collection process of Multilingual Grade School Math (MGSM), to

our knowledge the ﬁrst multilingual arithmetic reasoning benchmark.

2 3 4 5 6 7 8

# Steps

# Problems

13 13

Figure 2: MGSM problem distribution

with respect to the number of reasoning

steps in the standard solution.

Source data.

We used GSM8K (Cobbe et al., 2021), an

English-language human-annotated grade-school math

problem dataset, as the base data source. For MGSM,

we took the ﬁrst 250 examples from the GSM8K ofﬁcial

test example list. Each problem requires two to eight

steps to solve according to the ofﬁcial solution (Figure 2).

The answer for each question in GSM8K was written as

an Arabic numeral, which we kept consistent across all

languages to facilitate cross-lingual prediction.

Target language selection.

We selected a typologi-

cally diverse set of ten languages other than English (EN),

spanning eight language families and different levels of

representation in standard pretraining datasets such as

mC4 (Xue et al., 2021): Bengali (BN), Chinese (ZH),

French (FR), German (DE), Japanese (JA), Russian (RU),

Spanish (ES), Swahili (SW), Telugu (TE), and Thai (TH).

Certain scripts such as Devanagari employ different numerals. We restrict the data to Arabic numerals for

consistency but future work may investigate cross-lingual numeracy by mapping Arabic numerals to those of the

corresponding script (see Spithourakis & Riedel, 2018).

Original Question Frage: Roger hat 5 Tennisbälle. Er kauft noch 2 Dosen Tennisbälle. In jeder

Dose sind 3 Tennisbälle. Wie viele Tennisbälle hat er jetzt?

DIRECT Antwort: 11

NATIVE-COT Schritt-für-Schritt-Antwort: Roger begann mit 5 Bällen. 2 Dosen von jeweils 3

Tennisbällen macht 6 Tennisbälle. 5 + 6 = 11. Die Antwort ist 11.

EN-COT

Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each

is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Translated

English Question

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each

can has 3 tennis balls. How many tennis balls does he have now?

TRANSLATE-EN

Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each

is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Table 1: Example solution formats (§3) for a German exemplar problem, where German-speciﬁc

components are underlined and are changed to the corresponding translations for other investigated

languages. For DIRECT, NATIVE-COT and EN-COT, we provide the original German question

as input to the model and expect an answer in the corresponding format; for TRANSLATE-EN, we

input the translated question in English, and expect a step-by-step solution in English. To obtain the

desirable output format, we prepend few-shot examples in the corresponding format.

Manual translation process.

We enlisted the help of paid professional translators (two for Chinese

and German, three for Russian, ﬁve for Thai, one for each remaining target language) for the

manual translation of the 250 selected English-language examples from GSM8K. All translators

involved were native speakers of the target language and had at least two years of professional

experience in translating between English and the target language. All translators had signed a

machine translation (MT) non-usage declaration before they started to work. To verify the quality of

the human translations, the vendor sent a random subset of translations to an additional translator

to verify the quality, and checked for

-gram overlap with popular MT providers to ensure that

no machine translation toolkit has been used. We employ the translation results as gold standard

translations.

3 MULTILINGUAL CHAIN-OF-THOUGHT PROMPTING

We provide an overview of standard prompting and chain-of-thought prompting, as well as their

extensions to the multilingual setting, which we illustrate in Table 1 and use in our experiments (§4).

In standard prompting, given a prompt in the source language, the model is asked to predict the

answer (Brown et al., 2020; Schick & Schütze, 2021). This can be done in a zero-shot or few-shot

setting by providing exemplars following the same template as additional input to the model. We

refer to this setting as

direct answer prediction (DIRECT)

as the model directly predicts the answer

to the problem. This setting measures the model’s ability to solve problems without any intermediate

reasoning steps.

Chain-of-thought (COT; Wei et al., 2022b) prompting helps improve many few-shot reasoning tasks,

by augmenting few-shot examples with intermediate reasoning steps that should be predicted by the

model. In the multilingual setting, we can apply CoT to

solve the problem in the native language

(NATIVE-COT)

by predicting the reasoning steps in the original language of the problem. This

measures the model’s ability to both understand and solve the problem in a speciﬁc language.

Alternatively, we can ask the model to

predict the chain of thought in English (EN-COT)

, regard-

less of the problem language. Such an approach may be useful as English is often used as the source

language for cross-lingual transfer (Hu et al., 2020) and has been found effective when used as the

prompt language (Zhao & Schütze, 2021; Winata et al., 2021; Lin et al., 2021b).

Finally, we can

translate the problem to English and solve it with English CoT (TRANSLATE-

EN)

. In this setting, we use the Google Translate API to translate problems into English. This mirrors

the translate-train setup (Hu et al., 2020; Xue et al., 2021; Ruder et al., 2021), the best-performing

setting for ﬁne-tuning multilingual models where the training data is translated to English.

剩余19页未读，继续阅读

评论收藏

内容反馈

IT徐师兄

粉丝: 2319
资源: 2862

Language Models are Multilingual Chain-of-Thought Reasoners.pdf

最新资源

Language Models are Multilingual Chain-of-Thought Reasoners.pdf

Multimodal Chain-of-Thought Reasoning in Language Models.pdf

Large Language Models are Zero-Shot Reasoners.pdf

Everything-1.3.4.686.x64.Multilingual

Everything-1.3.4.686.x64&x86

Foxit-PDF-Editor-Pro-12.1.1.15289-Multilingual.rar

文件超强搜索Everything-1.3.4.686.x86.Multilingual-Setup.exe

Python库 | mongoengine-multilingual-field-0.3.tar.gz

Python库 | plone.app.multilingual-6.0.0a2-py2.py3-none-any.whl

Everything-1.3.4.686.x86.Multilingual

Everything-1.3.4.686.x64.Multilingual_文件搜索

Everything-1.3.4.686.快速搜索NTFS文件

Everything-1.3.4.684b.x64.Multilingual-Setup.1406791576.exe

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Serv.U.File.Server.Gold.Enterprise.v9.3.0.1.Multilingual.Keymaker.And.Patch-CORE-[OYKSOFT.COM].rar

Everything-1.4.0.705b.x86.Multilingual-Setup

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 模型

Davlan/bert-base-multilingual-cased-ner-hrl NER命名实体识别模型

Bark 是由 Suno 创建的基于转换器的文本到音频模型

paraphrase-mpnet-base-v2

PDF-XChange_Editor_Plus_8.0.331.0_x64_Multilingual.rar

Serv-U.File.Server.Gold.Enterprise.v9.3.0.1.

PDF_Anti-Copy_Pro_2.4.0.4_Multilingual_Downloadly.ir.rar

Machine Vision and Human-Machine Interface-Nova Science(2016).pdf

Everything-1.3.4.686.x64.Multilingual.zip

PDF_Anti-Copy Pro_2.4.0.4 Multilingual.rar

Movavi-Video-Converter-22.5.0-x64-Premium-Multilingual.rar

wpml多语插件破解版sitepress-multilingual-cms.3.7.1

Everything-1.3.4.686.x86.Multilingual.rar

Rhinoceros 6.0.18016.23451 SR0 Multilingual-3of3

Chinese-English_Translate-6.1.zip

最新资源