自然语言处理领域：自动化检索增强生成系统的评估框架ARES资源-CSDN文库

版权申诉

自然语言处理

机器学习模型

56 浏览量 2024-12-03 19:41:52 上传评论收藏 428KB PDF 举报

资源推荐

资源详情

资源评论

ARES: An Automated Evaluation Framework for Retrieval-Augmented

Generation Systems

Jon Saad-Falcon

Stanford University

jonsaadfalcon@stanford.edu

Omar Khattab

Stanford University

okhattab@stanford.edu

Christopher Potts

Stanford University

cgpotts@stanford.edu

Matei Zaharia

UC Berkeley and Databricks

matei@databricks.com

Abstract

Evaluating retrieval-augmented generation

(RAG) systems traditionally relies on hand

annotations for input queries, passages to re-

trieve, and responses to generate. We introduce

ARES, an Automated RAG Evaluation System,

for evaluating RAG systems along the dimen-

sions of context relevance, answer faithfulness,

and answer relevance. Using synthetic training

data, ARES ﬁnetunes lightweight LM judges

to assess the quality of individual RAG compo-

nents. To mitigate potential prediction errors,

ARES utilizes a small set of human-annotated

datapoints for prediction-powered inference

(PPI). Across six different knowledge-intensive

tasks in KILT and SuperGLUE, ARES accu-

rately evaluates RAG systems while using a

few hundred human annotations during evalu-

ation. Furthermore, ARES judges remain ef-

fective across domain shifts, proving accurate

even after changing the type of queries and/or

documents used in the evaluated RAG systems.

We make our datasets and code for replication

and deployment available at

https://github.

com/stanford-futuredata/ARES.

1 Introduction

Retrieval-augmented generation (RAG) has be-

come a prominent approach for building user-

facing NLP applications, such as systems for ques-

tion answering (QA), fact-checking, and customer

support (Petroni et al., 2021; Wang et al., 2019).

Typically, a RAG system consists of a retriever

and a downstream language model. Given a user

question, the retriever ﬁnds relevant passages from

a corpus (e.g., a company’s internal knowledge

base) and the language model uses these passages

to generate a response. This formulation admits a

multitude of choices: what retrieval model to use,

how to divide the documents into retrieval chunks,

and how to prompt or ﬁnetune the language model

to use the retrieved information, to name only a few

of the simplest design decisions.

The best design for a RAG system is not neces-

sarily universal across data domains, corpus sizes,

and cost/latency budgets. To tune their own RAG

systems, practitioners traditionally need hand an-

notations for test questions, passages to retrieve

(to assess the retriever), and responses to generate,

labeled speciﬁcally for their target domain. Alter-

natively, they may evaluate different approaches in

production by collecting human preferences that

compare the candidate systems. Unfortunately,

both of these strategies demand high expertise and

impose considerable annotation costs.

Model-based evaluation has emerged as cheap

and automatic strategy to test generative output

quality (Zheng et al., 2023). For instance, the open-

source RAGAS (James and Es, 2023) framework

is an attempt at prompting a language model for

evaluating the relevance of retrieved information

and the faithfulness and accuracy of generated re-

sponses. Unfortunately, such strategies currently

rely for evaluation on a ﬁxed set of heuristically

hand-written prompts, offering little adaptability to

various evaluation contexts.

To evaluate RAG systems rapidly and accu-

rately, we propose ARES, the Automated RAG

Evaluation System. ARES is the ﬁrst automated

RAG evaluation system to generate tailored LLM

judges for each component of a RAG pipeline, lead-

ing to substantial boosts in evaluation precision and

accuracy compared to existing approaches like RA-

GAS. Furthermore, unlike existing RAG evaluation

systems, ARES provides statistical guarantees for

its predictions by leveraging prediction-powered

inference (PPI), generating conﬁdence intervals for

its scoring. Given a particular corpus of documents

and a RAG system, ARES reports three evaluation

scores: context relevance (i.e., is the retrieved in-

formation pertinent to the test question), answer

faithfulness (i.e., is the response generated by the

language model properly grounded in the retrieved

context), and answer relevance (i.e., is the response

arXiv:2311.09476v1 [cs.CL] 16 Nov 2023

also relevant to the question). A good RAG system

ﬁnd relevant contexts and generates answers that

are both faithful and relevant.

Many existing RAG evaluation frameworks re-

quire substantial human annotations for scoring.

ARES signiﬁcantly improves data efﬁciency dur-

ing evaluation by only requiring three inputs: a set

of passages from the target corpus, a human pref-

erence validation set of 150 annotated datapoints

or more, and ﬁve few-shot examples of in-domain

queries and answers, which are used for prompting

LLMs in synthetic data generation. Given the cor-

pus of in-domain passages, ARES proceeds in three

stages. First, it leverages a language model to con-

struct a synthetic dataset of question–answer pairs,

derived from the passages in the corpus. Second,

ARES deﬁnes three separate judge models to per-

form three classiﬁcation tasks (context relevance,

answer faithfulness, and answer relevance). These

judges are lightweight models ﬁne-tuned against a

contrastive learning objective. Third, ARES ranks

the different RAG systems being assessed using

prediction-powered inference (PPI; Angelopoulos

et al. 2023) to improve model-based evaluation ac-

curacy and provide statistical conﬁdence intervals

for RAG scoring. PPI utilizes a small set of human

annotated datapoints for computing its conﬁdence

intervals; we designate this annotated set as our hu-

man preference validation set, which is composed

of 150 annotated datapoints or more that designate

both positive and negative examples for context rel-

evance, answer faithfulness, and answer relevance.

This work makes the following contributions.

First, we propose the ARES framework for eval-

uating the context relevance, answer faithfulness,

and answer relevance of RAG systems using only

the corpus passages and a human preference val-

idation set of 150 datapoints or more. Second,

we offer a novel development pipeline for ﬁne-

tuning lightweight LLMs on synthetically gen-

erated queries and answers. We further bolster

our lightweight LLM judges by using prediction-

powered inference (PPI) and human annotations

to provide statistical guarantees to ARES scoring

of RAG systems. Third, we conduct extensive

empirical evaluations, demonstrating that ARES

accurately scores RAG systems across the six

knowledge-intensive datasets in KILT and Super-

GLUE, beating existing automated evaluation ap-

proaches like RAGAS by 59.29 and 14.4 percent-

age points on average across context relevance and

answer relevance evaluation accuracy, respectively.

We also ﬁnd that ARES consistently distinguishes

competitive RAG systems that are only a few points

apart in ground-truth metrics. This precision en-

ables ARES to guide the development and compar-

ison of competitive approaches and conﬁgurations.

We provide the datasets and code for replicating

and deploying ARES on Github.

2 Related Work

Retrieval-augmented generation (RAG; Guu et al.

2020; Lewis et al. 2020; Khattab et al. 2021; Izac-

ard et al. 2022) is now a common strategy for bol-

stering LLMs by combining them with retrieval

systems. Through retrieval, RAG helps LM sys-

tems gather domain-speciﬁc knowledge, ground

generations in factual information (Shuster et al.,

2021; Huo et al., 2023), and offer a degree of trans-

parency or interpretability via citing sources (Mi-

alon et al., 2023).

LLM-based evaluation techniques have emerged

for gauging LLM systems. This is essential for

rapid deployment in new settings, where it is im-

practical to build a traditional benchmark dataset

from scratch. Early attempts at this use LLMs

out of the box, like in MT-Bench and Chatbot

Arena (Zheng et al., 2023). AutoCalibrate (Liu

et al., 2023b) seeks to align an LLM-judge with

human preferences, leveraging a self-reﬁnement

prompt to iteratively improve the LLM judge.

Other work has used LLM prompting to evaluate

system quality across natural language generation

tasks, such as translation, summarization, and dia-

logue generation (Kocmi and Federmann, 2023; Fu

et al., 2023; Liu et al., 2023a; Wang et al., 2023).

In the context of knowledge-intensive NLP tasks,

LLMs have been explored for assessing attribution

and factuality in LLMs (Min et al., 2023; Gekhman

et al., 2023; Yue et al., 2023). New guidelines like

LongEval (Krishna et al., 2023) and datasets like

Hagrid and ALCE (Kamalloo et al., 2023; Gao

et al., 2023) provide resources for analyzing evalu-

ating knowledge-intensive LLM pipelines.

The two most-closely related projects to our

work are EXAM (Sander and Dietz, 2021) and RA-

GAS (James and Es, 2023). To evaluate RAG sys-

tems, the EXAM metric estimates how many exam

questions a reader (simulated as a QA system) can

answer correctly based on the generated response.

The EXAM metric requires a set of queries with

several associated sub-questions each, which adds

a substantial burden that ARES does not require.

RAGAS is a recent evaluation framework based on

a handful of simple hand-written prompts. These

heuristic prompts offer little adaptability to new

RAG evaluation settings (e.g., new corpora) and,

as we show in our evaluation, substantially under-

performs ARES.

3 ARES

The ARES evaluation framework proceeds in three

stages, which we illustrate in Figure 1. ARES

requires three inputs for the pipeline: a set of pas-

sages from the target corpus, a human preference

validation set of 150 annotated datapoints or more,

and ﬁve few-shot examples of in-domain queries

and answers, which are used for prompting LLMs

in synthetic data generation. With our inputs pre-

pared, we begin by generating synthetic queries

(and their answers) from the passages in the target

corpus. We then use these query–passage–answer

triples to train our LLM judges (e.g., for detecting

answer relevance). Subsequently, we apply these

judges to any RAG system, scoring a sample of its

in-domain query-document-answer triples, and use

prediction-powered inference (PPI) with our hu-

man preference validation set to reliably estimate

a conﬁdence interval for the quality of each RAG

system.

3.1 LLM Generation of Synthetic Dataset

We generate synthetic queries and answers from

the corpus passages using generative LLMs. The

generated data represent both positive and negative

examples of query–passage–answer triples (e.g.,

relevant/irrelevant passages and correct/incorrect

answers). For generation, the LLM uses our in-

put set of few-shot examples with in-domain pas-

sages mapped to in-domain queries and answers;

the model then generates a synthetic question and

answer from a given in-domain passage, allowing

us to create both positive and negative training ex-

amples. We include example prompts for generat-

ing synthetic queries and answers in A.5.

For creating our synthetic data, we primarily rely

on FLAN-T5 XXL (discussed in subsection 4.1).

ARES works well with this model (see section 5)

but our system can ultimately use another high-

quality model for generating synthetic queries and

answers. To ﬁlter out low-quality queries, we verify

that a given query can retrieve its original passage

as the top result using its retriever system. The

ﬁltering approach has been used in previous work

to isolate high-quality synthetic queries (Dai et al.,

2022; Saad-Falcon et al., 2023).

To generate negatives for ﬁne-tuning our LLM

judges, we rely on two novel strategies, generating

the same number of negatives with each strategy:

Weak Negative Generation: For context rel-

evance negatives, we randomly sample in-

domain passages unrelated to a given syn-

thetic query. For answer faithfulness and

answer relevance negatives, we randomly

sample synthetically-generated answers from

other passages, which were created using

FLAN-T5 XXL.

Strong Negative Generation: For context

relevance negatives, we randomly sample in-

domain passages from the same document as

the gold passage. For datasets in which mul-

tiple passages are not available for the same

document, we use BM25 to retrieve the top-

10 passages similar to the passage and sample

from them for our context relevance strong

negatives. For answer faithfulness and an-

swer relevance negatives, we prompt FLAN-

T5 XXL (discussed in Section 4.1) to gener-

ate a contradictory answer using the few-shot

prompt in Section A.4.

In total, the number of negatives generated

equals the number of positives generated for eval-

uating context relevance and answer relevance in

RAG systems.

3.2 Preparing LLM Judges

To prepare our RAG evaluation judges, we use our

synthetic dataset to ﬁne-tune several lightweight

LLMs. We ﬁne-tune our LLM judges to evaluate

the RAG systems across three different capabilities,

each of which are often analyzed by researchers

and practitioners to gauge RAG system perfor-

mance (Chen et al., 2023; James and Es, 2023):

Context Relevance: Is the passage returned

relevant for answering the given query?

Answer Faithfulness: Is the answer gener-

ated faithful to the retrieved passage? Or does

it contain hallucinated or extrapolated state-

ments beyond the passage?

Answer Relevance: Is the answer generated

relevant given the query and retrieved pas-

sage?

剩余13页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2304
资源: 2398

自然语言处理领域：自动化检索增强生成系统的评估框架ARES

c-ares-1.16.1.tar.gz

ARES:先进的机器人探索系统

ares是一个跨平台、开源、多系统的模拟器，专注于准确性和保存。.zip

DELPHI Ares 聊天服务器端

Proteus ARES

最新MOD平台ares_0.3

c-ares-c-ares-master.zip

c-ares-1.10.0.tar.gz

Ares:请求系统

Ares Galaxy(P2P共享软件)V2.4.1带聊天室.rar

ARES移动开发技术博客@qianmi.com.zip

ARES_PROTEUS中文说明书

Ares 2.3.0官方简体中文版.rar

c-ares-1.12.0.tar.gz

奥洛斯ares AE906格机解锁图形锁指令

c-ares：用于异步DNS请求的AC库

MOTEC ARES系列直流伺服样本20170215.pdf

ARES User Guide

数据结构ares上传

PROTEUS入门实例教程5--–_用ISIS_和_ARES_设计PCB.doc

c-ares-1.14.0.tar.gz

Proteus是一款功能强大的电子设计自动化（EDA）软件.docx

ARES.rar ISIS.rar

红警2 Ares平台源码

PROTUES ARES

12MiniGarden网上鲜花订购系统

proteus-电子设计自动化领域的全面工具：Proteus

stable-diffusion部署需要的包

大规模语言模型：从理论到实践

最新资源