没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
内容概要:本文介绍了ARES(自动化检索增强生成系统评价框架),这是一个专门用于评估检索增强生成(RAG)系统的新方法。ARES能够自动化地对RAG系统的上下文相关性、答案的一致性和相关性进行精准评估。通过利用合成数据训练轻量级LLM判断器,并结合预测增强推理(PPI)技术减少模型预测误差,ARES能够在多种任务上提供比现有框架更高的精度。实验证明,ARES相比目前流行的自动评估工具如RAGAS在多个指标上有显著提升,尤其是在跨领域应用时仍能保持较高的准确性。 适合人群:自然语言处理领域的研究人员、开发者以及涉及RAG系统应用的相关技术人员。 使用场景及目标:为构建高质量的RAG系统提供有效的评估手段,帮助研究人员优化模型配置,选择最佳的设计方案。适用于各种需要RAG技术支持的应用,如问答系统、事实核查、客户服务等。 其他说明:论文详细描述了ARES的具体流程和技术细节,同时也讨论了未来改进的方向。此外,作者公开了ARES的数据集和代码,以便研究社区进行进一步的研究和验证。
资源推荐
资源详情
资源评论
ARES: An Automated Evaluation Framework for Retrieval-Augmented
Generation Systems
Jon Saad-Falcon
Stanford University
jonsaadfalcon@stanford.edu
Omar Khattab
Stanford University
okhattab@stanford.edu
Christopher Potts
Stanford University
cgpotts@stanford.edu
Matei Zaharia
UC Berkeley and Databricks
matei@databricks.com
Abstract
Evaluating retrieval-augmented generation
(RAG) systems traditionally relies on hand
annotations for input queries, passages to re-
trieve, and responses to generate. We introduce
ARES, an Automated RAG Evaluation System,
for evaluating RAG systems along the dimen-
sions of context relevance, answer faithfulness,
and answer relevance. Using synthetic training
data, ARES finetunes lightweight LM judges
to assess the quality of individual RAG compo-
nents. To mitigate potential prediction errors,
ARES utilizes a small set of human-annotated
datapoints for prediction-powered inference
(PPI). Across six different knowledge-intensive
tasks in KILT and SuperGLUE, ARES accu-
rately evaluates RAG systems while using a
few hundred human annotations during evalu-
ation. Furthermore, ARES judges remain ef-
fective across domain shifts, proving accurate
even after changing the type of queries and/or
documents used in the evaluated RAG systems.
We make our datasets and code for replication
and deployment available at
https://github.
com/stanford-futuredata/ARES.
1 Introduction
Retrieval-augmented generation (RAG) has be-
come a prominent approach for building user-
facing NLP applications, such as systems for ques-
tion answering (QA), fact-checking, and customer
support (Petroni et al., 2021; Wang et al., 2019).
Typically, a RAG system consists of a retriever
and a downstream language model. Given a user
question, the retriever finds relevant passages from
a corpus (e.g., a company’s internal knowledge
base) and the language model uses these passages
to generate a response. This formulation admits a
multitude of choices: what retrieval model to use,
how to divide the documents into retrieval chunks,
and how to prompt or finetune the language model
to use the retrieved information, to name only a few
of the simplest design decisions.
The best design for a RAG system is not neces-
sarily universal across data domains, corpus sizes,
and cost/latency budgets. To tune their own RAG
systems, practitioners traditionally need hand an-
notations for test questions, passages to retrieve
(to assess the retriever), and responses to generate,
labeled specifically for their target domain. Alter-
natively, they may evaluate different approaches in
production by collecting human preferences that
compare the candidate systems. Unfortunately,
both of these strategies demand high expertise and
impose considerable annotation costs.
Model-based evaluation has emerged as cheap
and automatic strategy to test generative output
quality (Zheng et al., 2023). For instance, the open-
source RAGAS (James and Es, 2023) framework
is an attempt at prompting a language model for
evaluating the relevance of retrieved information
and the faithfulness and accuracy of generated re-
sponses. Unfortunately, such strategies currently
rely for evaluation on a fixed set of heuristically
hand-written prompts, offering little adaptability to
various evaluation contexts.
To evaluate RAG systems rapidly and accu-
rately, we propose ARES, the Automated RAG
Evaluation System. ARES is the first automated
RAG evaluation system to generate tailored LLM
judges for each component of a RAG pipeline, lead-
ing to substantial boosts in evaluation precision and
accuracy compared to existing approaches like RA-
GAS. Furthermore, unlike existing RAG evaluation
systems, ARES provides statistical guarantees for
its predictions by leveraging prediction-powered
inference (PPI), generating confidence intervals for
its scoring. Given a particular corpus of documents
and a RAG system, ARES reports three evaluation
scores: context relevance (i.e., is the retrieved in-
formation pertinent to the test question), answer
faithfulness (i.e., is the response generated by the
language model properly grounded in the retrieved
context), and answer relevance (i.e., is the response
arXiv:2311.09476v1 [cs.CL] 16 Nov 2023
also relevant to the question). A good RAG system
find relevant contexts and generates answers that
are both faithful and relevant.
Many existing RAG evaluation frameworks re-
quire substantial human annotations for scoring.
ARES significantly improves data efficiency dur-
ing evaluation by only requiring three inputs: a set
of passages from the target corpus, a human pref-
erence validation set of 150 annotated datapoints
or more, and five few-shot examples of in-domain
queries and answers, which are used for prompting
LLMs in synthetic data generation. Given the cor-
pus of in-domain passages, ARES proceeds in three
stages. First, it leverages a language model to con-
struct a synthetic dataset of question–answer pairs,
derived from the passages in the corpus. Second,
ARES defines three separate judge models to per-
form three classification tasks (context relevance,
answer faithfulness, and answer relevance). These
judges are lightweight models fine-tuned against a
contrastive learning objective. Third, ARES ranks
the different RAG systems being assessed using
prediction-powered inference (PPI; Angelopoulos
et al. 2023) to improve model-based evaluation ac-
curacy and provide statistical confidence intervals
for RAG scoring. PPI utilizes a small set of human
annotated datapoints for computing its confidence
intervals; we designate this annotated set as our hu-
man preference validation set, which is composed
of 150 annotated datapoints or more that designate
both positive and negative examples for context rel-
evance, answer faithfulness, and answer relevance.
This work makes the following contributions.
First, we propose the ARES framework for eval-
uating the context relevance, answer faithfulness,
and answer relevance of RAG systems using only
the corpus passages and a human preference val-
idation set of 150 datapoints or more. Second,
we offer a novel development pipeline for fine-
tuning lightweight LLMs on synthetically gen-
erated queries and answers. We further bolster
our lightweight LLM judges by using prediction-
powered inference (PPI) and human annotations
to provide statistical guarantees to ARES scoring
of RAG systems. Third, we conduct extensive
empirical evaluations, demonstrating that ARES
accurately scores RAG systems across the six
knowledge-intensive datasets in KILT and Super-
GLUE, beating existing automated evaluation ap-
proaches like RAGAS by 59.29 and 14.4 percent-
age points on average across context relevance and
answer relevance evaluation accuracy, respectively.
We also find that ARES consistently distinguishes
competitive RAG systems that are only a few points
apart in ground-truth metrics. This precision en-
ables ARES to guide the development and compar-
ison of competitive approaches and configurations.
We provide the datasets and code for replicating
and deploying ARES on Github.
2 Related Work
Retrieval-augmented generation (RAG; Guu et al.
2020; Lewis et al. 2020; Khattab et al. 2021; Izac-
ard et al. 2022) is now a common strategy for bol-
stering LLMs by combining them with retrieval
systems. Through retrieval, RAG helps LM sys-
tems gather domain-specific knowledge, ground
generations in factual information (Shuster et al.,
2021; Huo et al., 2023), and offer a degree of trans-
parency or interpretability via citing sources (Mi-
alon et al., 2023).
LLM-based evaluation techniques have emerged
for gauging LLM systems. This is essential for
rapid deployment in new settings, where it is im-
practical to build a traditional benchmark dataset
from scratch. Early attempts at this use LLMs
out of the box, like in MT-Bench and Chatbot
Arena (Zheng et al., 2023). AutoCalibrate (Liu
et al., 2023b) seeks to align an LLM-judge with
human preferences, leveraging a self-refinement
prompt to iteratively improve the LLM judge.
Other work has used LLM prompting to evaluate
system quality across natural language generation
tasks, such as translation, summarization, and dia-
logue generation (Kocmi and Federmann, 2023; Fu
et al., 2023; Liu et al., 2023a; Wang et al., 2023).
In the context of knowledge-intensive NLP tasks,
LLMs have been explored for assessing attribution
and factuality in LLMs (Min et al., 2023; Gekhman
et al., 2023; Yue et al., 2023). New guidelines like
LongEval (Krishna et al., 2023) and datasets like
Hagrid and ALCE (Kamalloo et al., 2023; Gao
et al., 2023) provide resources for analyzing evalu-
ating knowledge-intensive LLM pipelines.
The two most-closely related projects to our
work are EXAM (Sander and Dietz, 2021) and RA-
GAS (James and Es, 2023). To evaluate RAG sys-
tems, the EXAM metric estimates how many exam
questions a reader (simulated as a QA system) can
answer correctly based on the generated response.
The EXAM metric requires a set of queries with
several associated sub-questions each, which adds
a substantial burden that ARES does not require.
RAGAS is a recent evaluation framework based on
a handful of simple hand-written prompts. These
heuristic prompts offer little adaptability to new
RAG evaluation settings (e.g., new corpora) and,
as we show in our evaluation, substantially under-
performs ARES.
3 ARES
The ARES evaluation framework proceeds in three
stages, which we illustrate in Figure 1. ARES
requires three inputs for the pipeline: a set of pas-
sages from the target corpus, a human preference
validation set of 150 annotated datapoints or more,
and five few-shot examples of in-domain queries
and answers, which are used for prompting LLMs
in synthetic data generation. With our inputs pre-
pared, we begin by generating synthetic queries
(and their answers) from the passages in the target
corpus. We then use these query–passage–answer
triples to train our LLM judges (e.g., for detecting
answer relevance). Subsequently, we apply these
judges to any RAG system, scoring a sample of its
in-domain query-document-answer triples, and use
prediction-powered inference (PPI) with our hu-
man preference validation set to reliably estimate
a confidence interval for the quality of each RAG
system.
3.1 LLM Generation of Synthetic Dataset
We generate synthetic queries and answers from
the corpus passages using generative LLMs. The
generated data represent both positive and negative
examples of query–passage–answer triples (e.g.,
relevant/irrelevant passages and correct/incorrect
answers). For generation, the LLM uses our in-
put set of few-shot examples with in-domain pas-
sages mapped to in-domain queries and answers;
the model then generates a synthetic question and
answer from a given in-domain passage, allowing
us to create both positive and negative training ex-
amples. We include example prompts for generat-
ing synthetic queries and answers in A.5.
For creating our synthetic data, we primarily rely
on FLAN-T5 XXL (discussed in subsection 4.1).
ARES works well with this model (see section 5)
but our system can ultimately use another high-
quality model for generating synthetic queries and
answers. To filter out low-quality queries, we verify
that a given query can retrieve its original passage
as the top result using its retriever system. The
filtering approach has been used in previous work
to isolate high-quality synthetic queries (Dai et al.,
2022; Saad-Falcon et al., 2023).
To generate negatives for fine-tuning our LLM
judges, we rely on two novel strategies, generating
the same number of negatives with each strategy:
1.
Weak Negative Generation: For context rel-
evance negatives, we randomly sample in-
domain passages unrelated to a given syn-
thetic query. For answer faithfulness and
answer relevance negatives, we randomly
sample synthetically-generated answers from
other passages, which were created using
FLAN-T5 XXL.
2.
Strong Negative Generation: For context
relevance negatives, we randomly sample in-
domain passages from the same document as
the gold passage. For datasets in which mul-
tiple passages are not available for the same
document, we use BM25 to retrieve the top-
10 passages similar to the passage and sample
from them for our context relevance strong
negatives. For answer faithfulness and an-
swer relevance negatives, we prompt FLAN-
T5 XXL (discussed in Section 4.1) to gener-
ate a contradictory answer using the few-shot
prompt in Section A.4.
In total, the number of negatives generated
equals the number of positives generated for eval-
uating context relevance and answer relevance in
RAG systems.
3.2 Preparing LLM Judges
To prepare our RAG evaluation judges, we use our
synthetic dataset to fine-tune several lightweight
LLMs. We fine-tune our LLM judges to evaluate
the RAG systems across three different capabilities,
each of which are often analyzed by researchers
and practitioners to gauge RAG system perfor-
mance (Chen et al., 2023; James and Es, 2023):
1.
Context Relevance: Is the passage returned
relevant for answering the given query?
2.
Answer Faithfulness: Is the answer gener-
ated faithful to the retrieved passage? Or does
it contain hallucinated or extrapolated state-
ments beyond the passage?
3.
Answer Relevance: Is the answer generated
relevant given the query and retrieved pas-
sage?
剩余13页未读,继续阅读
资源评论
pk_xz123456
- 粉丝: 2304
- 资源: 2398
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 超好的Python学习教程简单易懂100%好用.zip
- 烟雾火焰检测62-YOLO(v5至v8)、COCO、CreateML、Darknet、Paligemma、TFRecord、VOC数据集合集.rar
- 用于在 Fastify 之间共享通用 Redis 连接的插件 .zip
- K230字谜游戏代码分享
- 用于 Redis 的 Electron,React GUI.zip
- eeeggggeeeeeee
- CodeBlocks13585-2024年11月最新编译版本
- 用于 JavaScript 的高性能 Redis 协议 (RESP) 解析器 由 Node Redis 和 ioredis 使用 .zip
- 用于 Caddy TLS 数据的 Redis 存储.zip
- 烟雾火焰检测49-YOLO(v5至v9)、COCO、CreateML、Darknet、Paligemma、TFRecord、VOC数据集合集.rar
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功