大型语言模型自动评估引用来源的有效性研究资源-CSDN文库

版权申诉

148 浏览量 2024-12-03 19:41:22 上传评论收藏 1.09MB PDF 举报

资源推荐

资源详情

资源评论

Automatic Evaluation of Attribution by Large Language Models

Xiang Yue Boshi Wang Ziru Chen Kai Zhang Yu Su Huan Sun

The Ohio State University

{yue.149,wang.13930,chen.8336,zhang.13253,su.809,sun.397}@osu.edu

Abstract

A recent focus of large language model (LLM)

development, as exempliﬁed by generative

search engines, is to incorporate external refer-

ences to generate and support its claims. How-

ever, evaluating the attribution, i.e., verifying

whether the generated statement is fully sup-

ported by the cited reference, remains an open

problem. Although human evaluation is com-

mon practice, it is costly and time-consuming.

In this paper, we investigate automatic evalu-

ation of attribution given by LLMs. We be-

gin by deﬁning different types of attribution

errors, and then explore two approaches for

automatic evaluation: prompting LLMs and

ﬁne-tuning smaller LMs. The ﬁne-tuning data

is repurposed from related tasks such as ques-

tion answering, fact-checking, natural language

inference, and summarization. We manually cu-

rate a set of test examples covering 12 domains

from a generative search engine, New Bing.

Our results on this curated test set and simu-

lated examples from existing benchmarks high-

light both promising signals and challenges.

We hope our problem formulation, testbeds,

and ﬁndings will help lay the foundation for

future studies on this important problem.

1 Introduction

Generative large language models (LLMs) (Brown

et al., 2020; Ouyang et al., 2022; Chowdhery et al.,

2022; OpenAI, 2023a,b, inter alia) often struggle

with producing factually accurate statements, re-

sulting in hallucinations (Ji et al., 2023). Recent

efforts aim to alleviate this issue by augmenting

LLMs with external tools (Schick et al., 2023) such

as retrievers (Shuster et al., 2021; Borgeaud et al.,

2022) and search engines (Nakano et al., 2021;

Thoppilan et al., 2022; Shuster et al., 2022).

Incorporating external references for generation

inherently implies that the generated statement is

Our code and dataset are available at:

https://github.

com/OSU-NLP-Group/AttrScore

backed by these references. However, the valid-

ity of such attribution, i.e., whether the generated

statement is fully supported by the cited reference,

remains questionable.

According to Liu et al.

(2023), only 52% of the statements generated by

state-of-the-art generative search engines such as

New Bing and PerplexityAI are fully supported by

their respective cited references.

Inaccurate attribution compromises the trustwor-

thiness of LLMs, introducing signiﬁcant safety

risks and potential harm. For instance, in health-

care, an LLM might attribute incorrect medical ad-

vice to a credible source, potentially leading users

to make harmful health decisions. Similarly, in

ﬁnance, faulty investment advice attributed to a re-

liable source may cause substantial ﬁnancial losses.

To identify attribution errors, existing attributed

LLMs (Nakano et al., 2021; Thoppilan et al., 2022)

rely heavily on human evaluation, which is both

expensive and time-consuming. For instance, the

average cost of annotating a single (query, answer,

reference) example is about $1 in Liu et al. (2023).

In the actual use of attributed LLMs, it is the user

who needs to be wary of the attribution and manu-

ally verify it, which puts a tremendous burden on

their side. Therefore, effective and reliable meth-

ods to automatically evaluate attribution and iden-

tify potential attribution errors are highly desired.

Towards this goal, we take the ﬁrst step by intro-

ducing AttrScore (Figure 1), a framework designed

for automatic evaluation of attribution and identi-

ﬁcation of speciﬁc types of attribution errors. We

propose a new problem formulation that catego-

rizes attribution into three types: 1) attributable:

the reference fully supports the generated state-

ment; 2) extrapolatory: the reference lacks sufﬁ-

cient information to support the generated state-

Attribution primarily refers to “the act of attributing some-

thing” in this paper, which is similar to “veriﬁability” as de-

ﬁned in Liu et al. (2023).

www.bing.com/new, www.perplexity.ai

arXiv:2305.06311v2 [cs.CL] 7 Oct 2023

Q: What is the temperature range on

the moon?

A: The average temperature on the

moon can range from -298 degrees F

(-183 degrees C) at night to 224 degrees

F (106 degrees C) during the day [1].

The temperature on the Moon

The average temperature on the Moon (at

the equator and mid latitudes) varies from

-298 degrees Fahrenheit (-183 degrees

Celsius), at night, to 224 degrees

Fahrenheit (106 degrees Celsius) during

the day. Because the Moon has no...

References

[1] coolcosmos.ipac.caltech.edu

Q: What is the highest price of a gallon

of regular gas in the US in 2022?

A: According to a news article by NPR,

as of June 2022, the highest average

price of a gallon of regular gas in the

US was $6.34 in California [1].

In 10 states, a gallon of gas now

costs more than $5

...U.S. gas prices hit a new record on

Monday, with a gallon of regular gas costing

an average of $4.865 (not adjusting for

inflation), according to the most recent data

from American Automobile Association...

References

[1] npr.org/states-highest-gas-prices

Q: What was the unemployment

rate in Germany in 2020?

A: According to [1], the unemployment

rate in Germany for 2020 was 4.31%,

which was a 1.17% increase from 2019.

It's important to note that the COVID...

Germany Unemployment Rate

1991-2023

Unemployment refers to the share of the

labor force that is without work but

available for and seeking...Germany

unemployment rate for 2020 was 3.81%,

a 0.67% increase from 2019.

References

[1] macrotrends.net/unemployment-rate

As an Attribution Validator, your task is to

verify whether a given a reference can support

the answer to a provided question. A contradictory

error means the answer contradicts the fact in the

attribution, while an extrapolatory error means that

there is not enough information in the attribution..

AttrScore

Attributable

Extrapolatory Contradictory

(1) Prompt LLMs with a clear

evaluation instruction

(2) Fine-tune LMs on a set of

diverse repurposed datasets

QA Fact-Checking

NLI Summarization

Figure 1: We make the ﬁrst step towards automatically evaluating attribution and identifying speciﬁc types of

errors with AttrScore. We explore two approaches in AttrScore: (1) prompting LLMs, and (2) ﬁne-tuning LMs on

simulated and repurposed datasets from related tasks.

ment, and 3) contradictory: the generated state-

ment directly contradicts the cited reference. Un-

like existing work (Bohnet et al., 2022) that uses

binary categorization (i.e., attributable or not) and

Liu et al. (2023) that deﬁnes the degree of refer-

ence support for the generated statement as “full”,

“partial”, or “no support”, our ﬁne-grained error

categorization aids humans in better understanding

the type of an attribution error made by an LLM.

This not only enhances safe system usage but also

provides valuable insights for future development

of mechanisms tailored to correct speciﬁc errors.

We explore two approaches in AttrScore: 1)

prompting LLMs and 2) ﬁne-tuning LMs on simu-

lated and repurposed data from related tasks such

as question answering (QA), fact-checking, natural

language inference (NLI), and summarization. For

evaluation, unlike existing work (Liu et al., 2023;

Gao et al., 2023) that only uses queries from exist-

ing benchmarks, we curate a set of test examples

covering 12 different domains from a generative

search engine, New Bing. This is the ﬁrst eval-

uation set for measuring the attribution of LLMs

with queries created based on real-life interactions,

hence avoiding the data contamination issue.

Our results indicate that both approaches show

reasonable performance on our curated and simu-

lated test sets; yet there is still substantial room for

further improvement. Major sources of evaluation

failures include insensitivity to ﬁne-grained infor-

mation comparisons, such as overlooking contex-

tual cues in the reference, disregard for numerical

values, and failure in performing symbolic opera-

tions. In light of these ﬁndings, we discuss poten-

tial directions for improving AttrScore, including

training models to be more strongly conditioned on

the reference, and augmenting them with external

tools for numerical and logical operations.

With the new formulation of attribution errors,

the development of AttrScore, the introduction of

new test sets, and the insights into challenges and

potential directions for future work, we hope our

work can help lay the foundation for the important

task of automatically evaluating LLM attributions.

2 Problem Formulation

The primary task in this paper is to evaluate attribu-

tion, which involves verifying whether a reference

provides sufﬁcient support for a generated answer

to a user’s query. Our task setting prioritizes one

reference per statement, a unit task that more com-

plex scenarios can be decomposed to. We study

such a setting as it forms the basis for dealing with

multiple references or distinct segments (Liu et al.,

2023; Gao et al., 2023).

Prior work, such as Rashkin et al. (2021); Gao

et al. (2022); Bohnet et al. (2022), mainly focuses

on binary veriﬁcation, i.e., determining if a refer-

ence supports the generated answer or not. We

propose advancing this task by introducing a more

ﬁne-grained categorization. Speciﬁcally, we clas-

sify attributions into three distinct categories:

•

Attributable: The reference fully supports the

generated answer.

•

Extrapolatory: The reference lacks sufﬁcient

information to validate the generated answer.

•

Contradictory: The generated answer contra-

dicts the information presented in the reference.

To illustrate, consider a contradictory example

(Figure 1). The query is “What was the unemploy-

ment rate in Germany in 2020?”, and the generated

answer is “4.31%”. However, the reference states

that the rate was “3.81%”, contradicting the gen-

erated answer. An extrapolatory instance, on the

other hand, would be a query about the “gas price

in California”. While the reference is relevant, it

does not contain speciﬁc information to verify the

correctness of the generated answer.

Following these examples, we see the impor-

tance of granularity in error classiﬁcation. A ﬁne-

grained classiﬁcation allows us to pinpoint the na-

ture of the errors, be it contradiction or extrapo-

lation. Users can better understand the type of

errors an LLM might make, enabling them to use

the model more safely. Additionally, such an er-

ror identiﬁcation system can guide future training

processes of attributed LLMs, leading to speciﬁc

mechanisms’ development to correct such errors.

Our categorization also offers a departure from

the existing approach (Liu et al., 2023), which em-

phasizes on degree of support (“full”, “partial”,

or “none”) rather than attribution error types. Our

approach highlights speciﬁc issues in attribution

We acknowledge that while these categories are generally

mutually exclusive, complex scenarios might blur the bound-

aries between them. However, such cases are very rare. For

the purpose of this study, we maintain their exclusivity to

enable clear and focused error analysis.

evaluation for more effective error management

and system improvement.

Formally, the task of attribution evaluation in-

volves a natural language query

, a generated an-

swer

, and a reference

from an attributed LLM.

The goal is to develop a function, denoted as

, that

inputs

(q, a, x)

and outputs a class label indicating

whether “according to

, the answer

to the query

q is attributable, extrapolatory or contradictory.”

3 Automatic Evaluation of Attribution

Following our problem deﬁnition, we introduce

two approaches for automatic evaluation of attri-

bution: prompting LLMs and ﬁne-tuning LMs on

simulated and repurposed data from related tasks.

3.1 Prompting LLMs

Recent research (Fu et al., 2023) has demonstrated

the possibility of prompting LLMs to evaluate the

quality of generated text using their emergent ca-

pabilities (Wei et al., 2022b), such as zero-shot in-

struction (Wei et al., 2022a) and in-context learning

(Brown et al., 2020). Following this approach, we

prompt LLMs, such as ChatGPT (OpenAI, 2023a),

using a clear instruction that includes deﬁnitions of

the two types of errors (as shown in Figure 1) and

an input triple of the query, answer, and reference

for evaluation. The complete prompt used in our

study can be found in Appendix Table 6.

3.2 Fine-tuning LMs on Repurposed Data

The primary challenge in ﬁne-tuning LMs for auto-

matic attribution evaluation is the lack of training

data. One potential approach is to hire annotators to

collect real samples, but the cost can be prohibitive.

Here, we ﬁrst repurpose datasets from three re-

lated tasks (fact-checking, NLI, and summariza-

tion). We then propose to further simulate more

realistic samples from existing QA benchmarks.

Repurpose data from fact-checking, NLI, and

summarization tasks. Given the connections be-

tween our attribution evaluation task and the tasks

of fact-checking, NLI, and summarization, we pro-

pose to utilize datasets from these ﬁelds to enrich

our training examples. Fact-checking data and NLI

data, with their emphasis on assessing the consis-

tency and logical relationship between claims (hy-

pothesis) and evidence (premise), mirrors our task’s

It is important to note that this evaluation focuses on the

“veriﬁability” of the answer based on the reference. It does not

measure the “relevance”, i.e., whether the answer correctly

responds to the query (Liu et al., 2023).

Query: Which apostle had a

thorn in his side?

Long Ans: Paul was an apostle

who had a thorn in his side [1].

Thorn in the flesh

Thorn in the flesh is a phrase of

New Testament origin used to

describe a chronic infirmity,

annoyance, or trouble in one's

life, drawn from Paul the

Apostle's use of the phrase in

his Second Epistle to the

Corinthians 12 : 7 -- 9

References

[1] en.wikipedia.org/wiki/

Thorn_in_the_flesh

Thorn in the flesh

Thorn in the flesh is a phrase of

New Testament origin used to

describe a chronic infirmity,

annoyance, or trouble in one's

life, drawn from Paul the

Apostle's use of the phrase in

his Second Epistle to the

Corinthians 12 : 7 -- 9

References

[1] en.wikipedia.org/wiki/

Thorn_in_the_flesh

Thorn in the flesh

Thorn in the flesh is a phrase of

New Testament origin used to

describe a chronic infirmity,

annoyance, or trouble in one's

life, drawn from John the

Apostle's use of the phrase in

his Second Epistle to the

Corinthians 12 : 7 -- 9

References

[1] en.wikipedia.org/wiki/

Thorn_in_the_flesh

Thorn (letter)

Thorn or þorn (Þ, þ) is a letter

in the Old English, Old Norse,

Old Swedish and modern

Icelandic alphabets, as well as

modern transliterations of the

Gothic alphabet, Middle Scots,

and some dialects of Middle

English. It was also used ...

References

[1] https://en.wikipedia.org/

wiki/Thorn_(letter)

(A): Attributable (B): Contradictory (C): Contradictory (D): Extrapolatory

Query: Which apostle had a

thorn in his side?

Long Ans: Phillip had a thorn

in his side [1].

Query: Which apostle had a

thorn in his side?

Long Ans: Paul was an apostle

who had a thorn in his side [1].

Query: Which apostle had a

thorn in his side?

Long Ans: The apostle who had

a thorn in his side is Paul [1].

Short Ans: Paul [1] Short Ans: Phillip [1] Short Ans: Paul [1] Short Ans: Paul [1]

Figure 2: Examples simulated from open-domain QA. We 1) use the original (question, answer, context) pair as an

attributable instance (A), 2) substitute the answer or the answer span in the context to simulate a contradictory error

example (B, C), and 3) replace the context with alternatives to simulate an extrapolatory error example (D). In order

for models trained the simulated data to generalize well to the long answer setting in real-life search engines like

New Bing, we convert the short answer to a long one (using ChatGPT).

objective of checking the supporting relationship

between reference documents and generated state-

ments. Summarization datasets, especially those

involving the detection of hallucinations (including

both intrinsic and extrinsic (Maynez et al., 2020),

could provide a useful starting point for identify-

ing attribution inconsistencies. Nevertheless, these

datasets would require suitable adaptation. We

keep their original data sequences and modify their

data label space to suit the speciﬁc needs of the

attribution evaluation deﬁnition. Additional infor-

mation on this can be found in Appendix A.

Simulate data from open-domain QA. QA bench-

marks provide an ideal platform for data simula-

tion, as they comprise questions, their correspond-

ing ground truth answers, and reference contexts.

These elements can be directly employed as at-

tributable examples (Figure 2, A). In open-domain

QA datasets, answers are typically brief text spans.

To cater to the long answer setting in most at-

tributed LLMs, we convert these short answers

into longer sentences using ChatGPT. For simulat-

ing contradictory errors, we propose two methods:

(1) The ﬁrst involves modifying the correct answer

with an alternative candidate from an off-the-shelf

QA model, an answer substitution model, or a ran-

dom span generator (Figure 2, B). (2) The second

retains the original answer but replaces the answer

span in the reference context with a comparable

candidate (Figure 2, C). To emulate extrapolatory

errors, we employ a BM25 retriever on the ques-

tion, retrieving relevant external documents from

resources such as Wikipedia, which do not contain

the ground truth answers (Figure 2, D). More de-

tails regarding the simulation of these errors from

QA datasets can be found in Appendix A.

4 Experimental Setup

4.1 Datasets

This section presents the datasets utilized for train-

ing and testing methods for automatic attribu-

tion evaluation. In particular, we develop two

evaluation sets, AttrEval-Simulation and AttrEval-

GenSearch, derived from existing QA datasets

and a generative search engine, respectively. The

dataset statistics are presented in Table 1.

Training data. To repurpose and simulate train-

ing examples, we follow the method in Section

3.2 based on four similar tasks’ datasets. For

QA, we consider NaturalQuestions (Kwiatkowski

et al., 2019). For fact-checking, we include FEVER

(Thorne et al., 2018), Adversarial FEVER (Thorne

et al., 2019), FEVEROUS (Aly et al., 2021), VI-

TAMINC (Schuster et al., 2021), MultiFC (Augen-

stein et al., 2019), PubHealth (Kotonya and Toni,

2020), and SciFact (Wadden et al., 2020). For NLI,

we include SNLI (Bowman et al., 2015), MultiNLI

(Williams et al., 2018), ANLI (Nie et al., 2020)

and SciTail (Khot et al., 2018). For summarization,

we include XSum-Halluc. (Maynez et al., 2020),

XENT (Cao et al., 2022), and FactCC (Kryscinski

Split

Related Tasks

Data Sources #Samples

Train

QA NaturalQuestions 20K

Fact-checking

FEVER, VITAMINC,

Adversarial FEVER,

FEVEROUS, SciFact

PubHealth, MultiFC

20K

NLI

SNLI, MultiNLI

ANLI, SciTail

20K

Summarization

XSum-Hallucinations,

XENT, FactCC

3.8K

Test

PopQA, EntityQuestions,

HotpotQA, TriviaQA,

WebQuestions, TREC

Annotated samples from a

generative search engine

242

Table 1: Statistics of the training and test datasets for

attribution evaluation. We include the distributions of

the labels and data sources in Appendix B.

et al., 2020). We use all examples in the summariza-

tion task datasets, and sample 20K examples from

QA, fact-checking, and NLI task datasets. We com-

bine all the simulated datasets to create the training

set for our main experiment.

AttrEval-Simulation. For testing, we ﬁrst simu-

late examples from six out-of-domain QA datasets:

HotpotQA (Yang et al., 2018), EntityQuestions

(Sciavolino et al., 2021), PopQA (Mallen et al.,

2022), TREC (Baudis and Sedivý, 2015), Trivi-

aQA (Joshi et al., 2017), and WebQuestions (Be-

rant et al., 2013). Note that we intend to use dif-

ferent QA datasets for training and testing, as to

test the model’s generalization ability, and evalu-

ate its performance across a diverse set of domains

and question formats. Our manual examination

indicates that 84% of 50 randomly sampled exam-

ples accurately align with their category, and the

labeling errors are primarily due to incorrect an-

notations in the original QA datasets or heuristics

used to formulate comparable answer candidates

for contradictory errors and to retrieve negative

passages for extrapolatory errors.

AttrEval-GenSearch. To examine the real-life ap-

plication of automatic attribution evaluation, ap-

proximately 250 examples from the New Bing

search engine are annotated carefully by the au-

thors. This process comprises two subtasks: creat-

ing queries and verifying attributions. To avoid the

issue of training data contamination, new queries

are manually created across 12 domains (Figure

3).

To facilitate and motivate query annotation,

The “AI/NLP Research” domain is inspired by recent

discussions on social media about testing LLMs’ knowledge

on researchers, e.g., “Is XX a co-author of the paper XX?”

Daily Life

7.1%

Pet & Animal

5.5%

Med & Health

6.7%

History

8.2%

Law

5.1%

Science

6.7%

Econ. & Fin.

18.4%

AI/NLP Research

24.3%

Business

5.9%

Geography

2.7%

Math

1.2%

Game&Anime

8.2%

Figure 3: Domain distribution of our annotated AttrEval-

GenSearch test set (covering 12 domains in total).

keywords from a speciﬁc domain are randomly

generated using ChatGPT, and relevant facts within

that domain are compiled from the Web.

In the veriﬁcation process, queries are sent to the

New Bing search engine under a balanced mode fol-

lowing Liu et al. (2023), which balances accuracy

and creativity. The validity of the output generated

by New Bing is evaluated, where we consider only

the ﬁrst sentence that answers the question along

with its reference. As we state in Section 2, our

evaluation emphasizes the error type in a single

reference per statement. In the case of a sentence

having multiple references or distinct segments (for

example, “XXX [1][2]” or “XXX [1] and YYY

[2]”), each reference or segment is treated as a

separate sample, and the attributions are veriﬁed

individually. Finally, the samples are categorized

by the annotators as attributable, contradictory, or

extrapolatory. Detailed annotation guidelines can

be found in Appendix D.

4.2 Implementation Details

In the conﬁguration of “prompting LLMs”, we

test Alpaca (Taori et al., 2023), Vicuna (Chiang

et al., 2023), ChatGPT (OpenAI, 2023a) and GPT-

4 (OpenAI, 2023b), where we use OpenAI’s ofﬁ-

cial APIs (

gpt-3.5-turbo, gpt-4-0314

)

, and

weights from Alpaca and Vicuna from the ofﬁcial

repository

. For Alpaca and Vicuna inference, doc-

uments are tokenized and truncated at a maximum

of 2048 tokens. We generate text with a temper-

ature of

. The prompts for the task of evaluat-

ing attribution are provided in Appendix Table 6,

We make an effort to collect new facts post-2021 to test

about “knowledge conﬂiction” (Zhou et al., 2023; Xie et al.,

2023) between parametric and external knowledge.

platform.openai.com/docs/api-reference/chat

Given GPT-4’s high cost and slow inference speed, we

evaluate it on 500 random samples from AttrEval-Simulation.

https://github.com/tatsu-lab/stanford_alpaca

https://github.com/lm-sys/FastChat

剩余20页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2304
资源: 2398

大型语言模型自动评估引用来源的有效性研究

大型语言模型对齐性评估指南：七大维度解析与测量研究

大型语言模型的全面回顾.pptx

大型语言模型的快速介绍.pptx

大型语言模型指令数据质量评估与选择方法的研究

大型语言模型中的长文本事实准确性评估与增强方法

【AI 人工智能】大型语言模型的实现技术原理与应用.pdf

基于大型语言模型的工具对电池研究的机遇与挑战.pdf

金融领域大型语言模型综合评估基准FinBen介绍与应用

TrueTeacher: 利用大型语言模型生成事实一致性评估的合成数据

大型语言模型相关应用 大型语言模型相关应用

大型语言模型零资源黑盒幻觉检测研究 - SelfCheckGPT

大模型高效大型语言模型架构搜索与权重重构方法研究

自然语言处理领域的大型预训练模型比较与评估研究

大型语言模型编码临床医学知识及性能评估

Python_WIP自动问答与大型语言模型的ArXiv论文.zip

评估大型语言模型自然语言解释的反事实模拟能力：衡量精准度与通用性的新方法及其应用

大型语言模型结合化学工具的创新方法及其应用与风险评估

超越少样本范式的大型语言模型提示编程研究

大型语言模型相关知识论文合集

大型语言模型对美国劳动力市场潜在影响早期研究

大型语言模型ChatGPT在内容排名与人类偏好的一致性研究

大型语言模型在中文金融领域的基准评估系统-CFBenchmark的引入与应用

大型语言模型的最佳应用介绍.docx

大规模语言模型：从理论到实践

大型语言模型路由与基准数据集应用研究

大型语言模型中自发出现心智理论能力的研究进展

大型语言模型通过模拟试错方法提高工具使用准确性

面向中文媒体领域的大型语言模型MediaGPT研究与应用

最新资源

大型语言模型相关应用大型语言模型相关应用