BenchmarkingFoundationModelswithLanguage-Model-as-an-Examine资源-CSDN文库

自然语言处理

165 浏览量 2024-09-11 21:57:50 上传评论收藏 7.61MB PDF 举报

资源推荐

资源详情

资源评论

Benchmarking Foundation Models with

Language-Model-as-an-Examiner

Yushi Bai

1∗

, Jiahao Ying

2∗

, Yixin Cao

, Xin Lv

, Yuze He

Xiaozhi Wang

, Jifan Yu

, Kaisheng Zeng

, Yijia Xiao

Haozhe Lyu

, Jiayin Zhang

, Juanzi Li

, Lei Hou

Tsinghua University, Beijing, China

Singapore Management University, Singapore

University of California, Los Angeles, CA, USA

Beijing University of Posts and Telecommunications, Beijing, China

Abstract

Numerous benchmarks have been established to assess the performance of founda-

tion models on open-ended question answering, which serves as a comprehensive

test of a model’s ability to understand and generate language in a manner similar to

humans. Most of these works focus on proposing new datasets, however, we see

two main issues within previous benchmarking pipelines, namely testing leakage

and evaluation automation. In this paper, we propose a novel benchmarking frame-

work, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable

examiner that formulates questions based on its knowledge and evaluates responses

in a reference-free manner. Our framework allows for effortless extensibility as

various LMs can be adopted as the examiner, and the questions can be constantly

updated given more diverse trigger topics. For a more comprehensive and equitable

evaluation, we devise three strategies: (1) We instruct the LM examiner to generate

questions across a multitude of domains to probe for a broad acquisition, and raise

follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation,

the examiner combines both scoring and ranking measurements, providing a reli-

able result as it aligns closely with human annotations. (3) We additionally propose

a decentralized Peer-examination method to address the biases in a single examiner.

Our data and benchmarking results are available at: http://lmexam.xlore.cn.

1 Introduction

Recently, many large foundation models [

], such as ChatGPT [

], LLaMA [

], and PaLM [

], have

emerged with impressive general intelligence and assisted billions of users worldwide. For various

users’ questions, they can generate a human-like response. However, the answers are not always

trustworthy, e.g., hallucination [

]. To understand the strengths and weaknesses of foundation models,

various benchmarks have been established [6, 7, 8, 9, 10].

Nevertheless, we see two main hurdles in existing benchmarking methods, as summarized below. (1)

Testing leakage. Along with increasing tasks and corpus involved in pre-training, the answer to the

testing sample may have been seen and the performance is thus over-estimated. (2) Evaluation au-

tomation. Evaluating machine-generated texts is a long-standing challenge. Thus, researchers often

convert the tasks into multi-choice problems to ease the quantitative analysis. This is clearly against

real scenarios — as user-machine communications are mostly open-ended Question Answering (QA)

or freeform QA [

]. On the other hand, due to the existence of a vast number of valid “good”

answers, it is impossible to deﬁne one or several groundtruth, making similarity-based matching

∗

Equal contribution

37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks.

LM as an

Examiner

Q: Which machine learning

algorithm is mainly used for

classification problems?

Q: Who is considered the

“father” of hip-hop?

Domain: ML

Domain: Hip-Hop

Knowledge breadth

Knowledge depth

Foundation models

A neural

network

Decision

Tree

SVM

FQ: What are the

advantages of

Decision Tree

for classification?

Decision Tree

is more stable.

FQ: How do SVM

algorithm handle

non-linearly

separable data?

By using kernel

functions…

Flan-T5 Vicuna ChatGPT

Rank

Score

Evaluation

Centralized

Evaluation

Decentralized

Evaluation

2 has the most

fluent answer

2 has the least

helpful answer

a b

a examines

b on its Q,

vice versa

Fa i r n e ss

Wo r k l o a d

Reliable

Peer-examination

Figure 1: Overview of our benchmarking method. The left part shows the use of language model

as an examiner. The examiner generates questions from various domains, allowing it to probe for

comprehensive understanding (knowledge breadth) as well as deep specialization (knowledge depth)

through follow-up questions (FQs). It then scores and ranks other models’ responses according to its

understanding of the subject, providing a reliable evaluation. The right part presents peer-examination,

a novel decentralized method that provides fairer evaluation results, which potentially demands higher

workload of running multiple LM examiners, compared to running a single LM examiner.

measurements (e.g., Exact Match, ROUGE-L [

], and BERTScore [

]) ineffective [

Therefore, recent works target a well-trained evaluator language model (LM) to assess the answer

quality in a reference-free manner [

]. However, using LM as an evaluator also presents a

problem: What if the evaluator hallucinates and makes wrong judgments during assessment?

As an attempt, our pilot study utilizes GPT-4 [

] to evaluate the correctness of LLaMA [

] on Natural

Questions [

], where non-negligible

out of

100

judgments are incorrect (cases in Appendix A). We

attribute the main reason to the inadequate knowledge of the evaluator itself regarding the questions.

A straightforward solution is to use the LM not just as an evaluator to assess the responses, but as a

knowledgeable examiner to also formulate questions, which is guaranteed a thorough understanding

of the judgments. And, it naturally addresses the testing leakage issue by generating new questions

periodically. Yet, relying on a centralized examiner can hardly be considered fair, especially when

evaluating the examiner itself — A man who is his own lawyer has a fool for his client.

In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, to

assess current foundation models, mitigating the aforementioned issues. Herein, the language model

acts as a knowledgeable examiner that poses questions based on its inherent knowledge and evaluates

others on their responses. We devise three strategies to alleviate potential bias:

•

Increasing Knowledge Breadth and Depth. In terms of breadth, according to a predeﬁned

taxonomy, we select as many diverse domains as possible to generate questions. In terms of depth,

to probe models deeply within a speciﬁc subﬁeld, we propose a multi-round setting where the

evaluator mimics an interviewer, posing more sophisticated follow-up questions based on the

interviewee model’s preceding responses. We release our dataset, namely LMExamQA, which is

constructed using GPT-4 [19] as an examiner.

•

Reliable Evaluation Measurement. We explore two evaluation metrics, namely Likert scale

scoring and Ranking, offering a more comprehensive evaluation result. The results from both

metrics correlate closely with human annotations, signiﬁcantly outperforming all previous metrics.

•

Peer-examination Mechanism. To avoid the potential bias arising from a single model as examiner,

we propose a decentralized evaluation setting where all participating models are invited to be the

examiner and assess each other.

In experiments, our benchmarking pipeline yields fruitful results on 8 popular foundation models. We

also demonstrate that peer-examination can generate a more diverse set of questions for knowledge

probing and balance the biases from individual evaluator models, ultimately leading to a more

equitable evaluation outcome.

2 Related Work

Benchmarks for Foundation Models. Various benchmarks have been proposed to assess foundation

models on open-ended question answering, since it is the most natural setting for user-machine interac-

tion in real scenarios. Some prominent such benchmarks include MS MARCO [

], SQuAD [

Natural Questions [

], WebQuestions [

] and OpenBookQA [

]. On the other hand, there exist a

limited number of datasets that feature long-form QA. One of the widely-recognized examples is

ELI5 [

], which comprises questions that necessitate lengthy descriptive and explanatory answers.

One notable limitation of these benchmarks is their reliance on human curation and annotation,

which inherently constrains their scalability. Our approach, by comparison, utilizes LMs to construct

datasets, offering the advantage of effortless extensibility.

Automating NLG Evaluation. To evaluate machine-generated responses to the questions, several

automatic metrics have been adopted, including the F1 score, Exact Match (EM), BLEU [

ROUGE [

], and METEOR [

]. However, each metric has its own shortcomings, resulting in large

discrepancies between the tested and actual performance [14, 29, 30].

To address these issues, well-trained LMs are utilized in NLG evaluation [

]. One

mainstream of previous methods is reference-based, where they derive the similarity between the

candidate and the reference using an LM. Some prominent metrics in this class include Mover-

Score [

], BERTScore [

]. These metrics measure the distributional similarity rather than lexical

overlap [

], making them appropriate for contexts that require more ﬂexible generation. Recent

studies [

] have demonstrated that large language models (LLMs), such

as ChatGPT [

], can conduct NLG evaluations in a reference-free manner. They can rate a candidate

text (or perform a comparative assessment of two candidates) based on a speciﬁed evaluation aspect,

displaying a high correlation with human assessments in tasks such as summarization and story

generation [

]. In these studies, the evaluations primarily focus on lexical quality aspects,

such as coherence and ﬂuency, of a generated text. However, their capability to evaluate crucial

aspects in a QA response, including factual correctness and information comprehensiveness, remains

uncertain. Moreover, a single evaluator inevitably brings bias to the assessment [

]. Our work aims

to resolve these issues by leveraging LM not just as an evaluator but also as an examiner, assessing the

performance of other models through self-generated questions, and deploying multiple LM examiners

to ensure balanced evaluation.

3 Methodology

In this section, we discuss the methodology in language-model-as-an-examiner, including the LMEx-

amQA dataset construction, the evaluation metric design, and the peer-examination pipeline.

3.1 Dataset Construction

Question Generation towards Knowledge Breadth. We employ a language model (LM) as an

examiner that generates diversifying and high-quality questions across various domains. To ensure

wide coverage of knowledge, we choose the Google Trends Categories

as the domain taxonomy,

and randomly select

domains from it. For each domain, we prompt the LM to generate

distinct

questions. Our designed prompt (shown in Appendix B) is formulated to ensure that the generated

questions possess three essential characteristics: diversiﬁed question forms, varied cognitive levels,

and most importantly, assurance that the LM has a comprehensive understanding of the knowledge

surrounding the question it poses. Figure 2 shows the distribution of question forms based on their

interrogative words, and the distribution of question domains. According to Bloom’s taxonomy [

we divide the questions into 3 categories based on their required cognitive levels, from low to

high-level, namely knowledge memorization, knowledge comprehension, and knowledge analysis:

•

Knowledge memorization. Questions of such level demand recognition or recollection of certain

entities and attributes, such as a person, location, or time.

•

Knowledge comprehension. These questions involve demonstrating an understanding of particular

instances or concepts, such as “What is . . . ”, “Why . . . ”, and “How . . . ”.

https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories.

(a) Question word distribution.

(b) Question domain distribution.

Figure 2: Statistics of generated questions in LMExamQA.

MS [21] SQuAD2.0 [23] NQ [20] ELI5 [26] Ours Example questions in our dataset

Analysis 1% 4% 3% 0%

What are the potential short and long-term

35%

impacts of divorce on children?

Comprehension 4% 13% 19% 100%

How does towing capacity affect a truck’s performance

43%

and what factors inﬂuence its maximum towing limit?

memorization 95% 83% 78% 0%

Which international organization publishes

22%

the World Economic Outlook report?

Table 1: Proportions of each level of questions. MS and NQ are short for MS MARCO and Natural

Questions. We also list an example question in LMExamQA for each category.

•

Knowledge analysis. Questions of this type require more advanced cognitive skills and they

typically question the impact, comparison, or advantages and disadvantages of a given topic.

By adopting GPT-4 to categorize the questions in LMExamQA and previous open-ended QA datasets

into three levels

, we obtain the distribution with respect to the 3 cognitive levels as listed in Table 1,

and show an example for each type of question. Compared with previous datasets, LMExamQA

achieves a more balanced distribution across these 3 levels, thus providing a means of quantifying

foundational models’ proﬁciency at each cognitive level. Furthermore, LMExamQA includes a larger

proportion of questions classiﬁed within higher cognitive levels, particularly at the analysis level,

indicating a greater level of challenge.

To justify the reliability of the LM examiner as an evaluator on these questions, we employ it to

produce a groundtruth answer with the prompt, “Answer the questions accurately and completely,

without providing additional details.” Upon evaluation by human experts on a random selection of

100 questions, the answers offered by the LM exhibit a

100%

accuracy rate, thereby demonstrating

mastery over the questions it generates.

Multi-round Follow-up Question Generation towards Knowledge Depth. To further probe the

model’s comprehension of a topic in depth, we develop an evaluation procedure involving multiple

rounds of follow-up inquiries, drawing inspiration from the interview process. We utilize the LM

examiner to construct a series of follow-up inquiries (prompt is shown in the Appendix B). These

follow-up questions are speciﬁcally tailored to delve deeper into the concepts presented within the

model-generated answers from the previous round. As the follow-up questions are dependent on

the model’s generated answers, we only ask follow-up questions for the correctly answered queries

(determined by the LM examiner) and calculate the proportion of correct responses in the subsequent

round. We limit the total number of rounds to

in order to minimize topic deviation that might occur

during longer sessions. Note that we only provide the interviewee model with the follow-up question

as input, rather than engaging the “exam history”

, since most models are not capable of multi-round

conversations. We show an example of a follow-up question to Flan-T5 [48]:

We manually label 100 of the questions in LMExamQA and ﬁnd a high agreement (85%) between human

and GPT-4 annotations.

It is important to note that our approach is essentially different with conversational QA [

], which

places greater emphasis on evaluating the model’s comprehension of the conversational context.

Question: Which material is primarily used to manufacture semiconductor devices?

Flan-T5: Silicon ✓

Follow-up Question: What are the advantages of using silicon as the primary material for

semiconductor devices?

Flan-T5: Silicon is a nonrenewable resource, and it is the most abundant element on Earth. ✗

3.2 Evaluation Metrics

Several methodologies are commonly employed to facilitate human-like evaluation in LMs, prominent

among these are the Likert scale scoring [

] and pairwise comparison [

]. For the

purposes of our benchmark, we incorporate both Likert scale scoring and a variant of pairwise

comparison, namely ranking.

Likert scale scoring functions as an absolute evaluative measure, where the evaluator assigns scores

to a given response along predeﬁned dimensions. We establish four distinct dimensions on our

dataset: (1) Accuracy. This assesses the extent to which the provided response accurately answers

the question. (2) Coherence. This evaluates the logical structure and organization of the response

and the degree to which it can be comprehended by non-specialists. (3) Factuality. This examines

whether the response contains factual inaccuracies. (4) Comprehensiveness. This gauges whether

the response encompasses multiple facets of the question, thus providing a thorough answer. Each

of these dimensions is scored on a scale of 1 to 3, ranging from worst to best. We also ask the

evaluator to provide an overall score ranging from 1 to 5, based on the scores assigned to the previous

4 dimensions. This score serves as an indicator of the overall quality of the answer.

On the other hand, pairwise comparison operates as a relative evaluation method and is often more

discerning compared to scoring. In this process, evaluators are given two responses and are tasked

with determining which is superior, taking into account their accuracy, coherence, factuality, and

comprehensiveness. Given that there are

contestant models, we implement a merge sort algorithm

to rank the n responses, involving O(n log n) pairwise comparisons.

3.3 Decentralized Evaluation: Peer-Examination

We introduce a novel decentralized method that incorporates multiple models to serve as examiners,

namely Peer-examination (illustrated in the right part of Figure 1), since relying only on one cen-

tralized model as the examiner introduces the following potential drawbacks to the benchmarking

process. (1) Coverage of generated questions: The examiner may not have a holistic understanding

of certain domain knowledge. As a result, the examiner may struggle to propose questions that

examine in detail on these areas, which in turn renders the scope of generated questions insufﬁcient.

(2) Potential bias during evaluation: The model itself may have a bias during evaluation. The bias

can manifest as a preference for certain types of responses or a predisposition towards perspectives

irrelevant to the quality of the responses, such as response length or linguistic style. For example,

[

] shows that GPT-4 [

] prefers ChatGPT [

] summaries compared to human-written summaries.

Such biases may result in unfair ranking assessment outcomes.

To mitigate these issues, during peer-examination, each model is assigned the role of an examiner

separately. As examiners, they are responsible for posing questions and evaluating the answers

provided by the other models. We then combine the evaluation results from each of these models

by voting, and obtain a ﬁnal result. This approach leverages the collective expertise and diverse

perspectives of all models to improve the coverage of questions as well as ensure fairer assessments.

4 Experiments

To demonstrate the effectiveness of our Language-model-as-an-examiner framework, we ﬁrst employ

GPT-4 [

] as the examiner for a centralized evaluation, since it exhibits a broad understanding of

knowledge [

] and a precise judgmental ability [

]. In peer-examination, we also employ

Claude (Claude-instant) [51], ChatGPT [2], Bard [52], and Vicuna-13B [38] as LM examiners.

剩余25页未读，继续阅读

评论收藏

内容反馈

啊我有兔子牙

粉丝: 4183
资源: 8

Benchmarking Foundation Models with Language-Model-as-an-Examine

SSFL-Benchmarking-Semi-supervised-Federated-Learning:对标半监督联合学习

libguestfs-benchmarking-1.40.2-10.el7.x86_64.rpm

HDAD-iGAV-Dataset-and-its-Benchmarking-GUI-with-Deep-Learning-Techniques-for-HAR

Benchmarking-taxonomic-classification-strategies-using-mock-fungal-communities:该存储库用于存储脚本等，并补充了Hu等人的解释。 “宏基因组学策略推断复杂社区真菌物种组成的综合比较”论文

AVL_SBBM_Tesla-Model3LR-2018_WP2_PowertrainTesting_DSB123.pdf

Benchmarking with DEA, SFA, and R

Benchmarking-SQL-vs-NoSQL:用于比较数据库性能的基于Java的工具

benchmarking-data-model:优秀的WP2基准数据模型存储库

Benchmarking-of-small-RNA-seq:小型RNA-seq文库制备方案的基准分析工作流程

Performance Evaluation and Benchmarking DEA with Spreadsheets 2014

Benchmarking YOLOv5 and YOLOv7 models with DeepSort .pdf

benchmarking-talk:关于基准测试的演示项目

benchmarking-gnns-pyg

language-benchmarking:特定语言的测试

benchmarking-gradle-jmh

Writing High-Performance NET Code(2nd).pdf 2018新版

python-template-benchmarking:用于比较Python模板引擎性能的工具包

Android代码-TagRec

Bochs - The cross platform IA-32 (x86) emulator

开源项目-gilbertchen-benchmarking.zip

Benchmarking a B-tree Compression Method-计算机科学

The Business Case For E-Learning

json-serialization-benchmarking,.zip

最新资源