构建可信行业领域大型语言模型：金融与医疗应用资源-CSDN文库

版权申诉

9 浏览量 2024-12-02 09:01:15 上传评论收藏 5.18MB PDF 举报

资源推荐

资源详情

资源评论

Towards Trustworthy Large Language Models

in Industry Domains

INF-Team

July 04, 2024

Abstract

This report addresses the challenges and strategies for mitigating hal-

lucinations in large language models (LLMs), particularly in domain-

speciﬁc applications. Hallucinations refer to the generation of unrealistic

or illogical outputs by LLMs. We explore several methods to reduce hallu-

cinations, including using high-quality domain-speciﬁc data for training,

ensuring that the model stays up-to-date with new knowledge, and em-

ploying alignment techniques to ensure that the LLM adheres to human

instructions. A key proposition is the adoption of neuro-symbolic sys-

tems, which combine large-scale deep learning models with symbolic AI.

These systems leverage neural networks for fast “black box” probabilistic

predictions while also enabling “white box” logical reasoning. The inte-

gration of these approaches represents a signiﬁcant technical direction for

future artiﬁcial general intelligence and provides a “gray box” approach to

developing trustworthy LLMs for industrial applications. This dual capa-

bility enhances logical reasoning and improves explainability. In addition,

we detail our eﬀorts to construct domain-speciﬁc LLMs for ﬁnance and

healthcare. Using anti-hallucination strategies, our ﬁnance LLM outper-

forms GPT-4 on the CFA tests, while our healthcare LLM ranks ﬁrst on

the public MedBench competition leaderboard.

1 Introduction

With the unprecedented development of large language models (LLMs), LLMs

are used to improve communication, generate creative text formats, translate

languages eﬀectively, and even assist scientiﬁc research. However, LLMs are

notorious for yielding unreliable outputs, which greatly hinder their applica-

tion in real-world tasks, especially for high-stakes decision-making applications

in industries such as healthcare, asset investment, criminal justice, and other

domains.

One of the key challenges is the “hallucination” problem, whereby LLMs may

output content that seems reasonable but is, in fact, incorrect or illogical. As

an inherent limitation of LLM [51], the hallucination phenomenon is inevitable.

Hallucinations can be divided into two categories [13,14]. One is the factuality

hallucination, where the generated content is inconsistent with the facts of the

real world and contains unexpected ﬁctitious concepts and plots. The other is

faithfulness hallucination, where the generated content is inconsistent with the

input instructions and logic. Both categories pose serious obstacles to ensuring

the accuracy and reliability of model outputs in industry applications.

Domain-speciﬁc LLMs focus on understanding and responding to a partic-

ular ﬁeld or industry, e.g., ﬁnance and healthcare, aiming to resolve domain-

speciﬁc tasks as highly trained professionals. For real-world industries where

domain-speciﬁc LLMs may play crucial roles in decision-making, being trustwor-

thy becomes more demanding. Therefore, domain-speciﬁc LLMs must address

the major limitations we discussed above, i.e. hallucinations.

Despite the massive training data in LLM that covers a wide range of top-

ics, a considerable portion of real-world domain knowledge is long-tail, and the

scarcity of domain data may contribute to hallucinations. In speciﬁc indus-

try domains, general LLMs may lack professional knowledge and fail to follow

technical instructions due to insuﬃcient domain data at the training stage.

To mitigate factuality hallucination, an eﬀective approach is to address the

issue of data scarcity in training. Feasible solutions include curating high-quality

factual data speciﬁcally for the domain, developing automatic data cleaning and

selection techniques, and designing execution engines for high-quality synthetic

data. The massive domain-speciﬁc data can greatly reduce the model’s tendency

to fabricate information after training.

Continuous training with high-quality domain-speciﬁc data helps LLMs ac-

quire domain knowledge to understand technical nuances in context and instruc-

tions, thereby alleviating faithful hallucinations as well. To further reduce faith-

fulness hallucination and improve productivity, we need to resort to alignment

techniques to ensure that LLMs actively cooperate with professional instructions

to achieve speciﬁc goals. Meanwhile, we design reward systems that motivate

LLMs to behave in a way that is consistent with human values and employ re-

inforcement learning to learn from preference feedback in a human-in-the-loop

system that provides safety guidance and supervision.

As a black-box model, LLMs cannot always explain their outputs correctly.

The presence of hallucinations is an intrinsic obstacle that prevents the continu-

ous yielding of reliable explanations, even if prompting LLMs to explain step by

step. Research in the ﬁeld of interpretability for LLMs is still in development.

For example, dictionary learning from Anthropic is a tool to understand the cor-

respondence between the model components and the particular inputs [3]. The

method has improved analytic capacity to break down the complexity of LLMs

into more understandable features. However, exploring all internal features

learned by LLMs during training is still cost-prohibitive, and eﬀectively manip-

ulating speciﬁc features for predictably superior behavior. Lack of explainability

is another challenge for LLMs in gaining trust in high-stakes applications where

a transparent decision process is critical.

To improve explainability in the behavior of LLMs, we can either dive into

attention mechanisms by tools such as dictionary learning to analyze the gen-

eration process, or prompt LLMs to carry out reﬂections and veriﬁcation step

by step and assign conﬁdence scores to its responses [7]. It is also possible to

produce counterfactual explanations by providing alternative scenarios where

the output of LLMs would change, oﬀering insights into its reasoning process.

While these methods allow users to assess the reliability of the information, the

interpretations are contaminated by the intrinsic hallucination within LLMs. To

break through innate limitations, we have introduced neural symbolic systems

to assist LLMs in gaining explainability and transparency in content generation.

Neural symbolic systems are emerging in AI that aim to combine the strengths

of two diﬀerent AI techniques, i.e., deep learning and symbolic AI. LLMs are

very successful exemplars of deep learning, which excels at learning from vast

amounts of data and generating creative text formats but struggles with tasks

requiring reasoning, logic, and explainability. Symbolic AI uses symbols and

rules to represent knowledge and excels in logical reasoning and explainability.

However, it can be less eﬃcient in learning from data. To bridge this gap by

integrating both approaches, we leverage the reading comprehension capabilities

of LLMs to process raw data and generate an initial understanding. The prelim-

inary instances are then passed to our in-house symbolic reasoning engine that

performs reasoning on domain-speciﬁc causal graphs to make decisions. The

ﬁnal output may combine the multiple interactive results from both modules.

The homemade symbolic reasoning engine can visualize all feasible reasoning

paths, oﬀering more logical and explainable outputs. We name this proposal as

a unique “gray box” approach to trustworthy LLMs in industry domains.

The rest of the content is organized as follows. In Section 2, we detail our

approach to implementing trustworthy domain-speciﬁc LLMs, including high-

quality data collection, alignment techniques, and neural symbolic computation.

In Sections 3 and 4, we introduce two domain-speciﬁc LLMs, healthcare and ﬁ-

nance, respectively. Our approach to trustworthy domain-sepciﬁc LLMs is com-

patible with any open source foundation model. To demonstrate the feasibility,

we use a homemade 34B foundation model to develop our healthcare LLM, and

choose the open-source Qwen2-72B base model for continuous training and in-

struction alignment to build our ﬁnance LLM. In Section 5, we conclude and

then discuss some directions for future work.

2 Methodology Overview

In this section, we present our methodology to construct trustworthy LLMs

in industry domains, which consists of three parts including high-quality data

preparation, alignment techniques, and neuro-symbolic computing techniques.

2.1 High-quality Data Preparation

2.1.1 General Data Processing Pipeline

High-quality training data are essential for eﬀective large language model (LLM)

training. To achieve this, we have amassed a substantial dataset. The primary

sources of text data include Common Crawl, Wikipedia, books, academic pa-

pers, journals, patents, news articles, and educational resources for K-12. For

code data, the main sources are GitHub and Stack Overﬂow. Following data

collection, we perform data cleansing, which involves three main steps: ﬁltering,

deduplication, and selection. The overall process is illustrated in Fig. 1.

Filtering: For the ﬁltering process, we employed heuristic rules to ﬁlter the

text, which helps avoid selection bias. These heuristic rules allow us to eliminate

low-quality data eﬀectively. Diﬀerent rules are applied to diﬀerent types of

text. The ﬁltering primarily focuses on removing duplicate texts using n-gram

repetition detection and sentence-level detection. Additionally, we created a list

of sensitive words and removed any documents that contained those words and

personal identiﬁable information.

Deduplication: Deduplication includes fuzzy deduplication and exact dedu-

plication [15]. For fuzzy deduplication, We employ Minhash-LSH for approx-

imate deduplication. The process involves several steps: (1) standardize the

text, split the text into sequences using text segmentation, and apply N-Gram

processing to the sequences; (2) compute Minhash values and compress them

into a set of bucketed hash values using Locality Sensitive Hashing (LSH). (3)

perform the approximate deduplication by the hashes. Then, we utilize a suﬃx

array algorithm for exact deduplication. This method includes: (1) dividing ﬁles

according to memory limitations; (2) loading them into memory to compute the

suﬃx array; (3) identifying duplicate intervals and deleting the entire document

exceeding a predeﬁned duplication threshold (to maintain text integrity). This

step requires substantial memory and is therefore performed at last.

Source

Document

Quality Singal

Computation

Fuzzy

Deduplication

Exact

Deduplication

Filtering

Figure 1: The ﬁgure illustrates the primary workﬂow of our data-cleaning pro-

cess. The purple sections indicate the ﬁltering stages, while the yellow sections

represent the deduplication process.

2.1.2 Recalling High Quality Data from Common Crawl

The Common Crawl (CC) dataset contains an immense collection of web pages.

Traditionally, our approach to processing CC data has been limited to ﬁltering

and deduplication, preventing more advanced analysis of this extensive dataset.

However, to identify reliable and high-quality data across various ﬁelds, we

need a more sophisticated processing method. We propose a ﬁne-grained divi-

sion strategy for CC data. Initially, we segment the URLs in all snapshots of

the CC dataset by base URL (e.g., www.google.com is considered a base URL).

We count the occurrences of each base URL and then rank them in descending

order. Our ﬁndings indicate that the top 2 million base URLs account for ap-

proximately 65% of the entire CC dataset. Therefore, we believe that annotating

these base URLs with their type, topic, and language will provide valuable in-

formation. This method allows for a preliminary ﬁne-grained segmentation of

CC data, although it may introduce some inaccuracies.

High-quality data plays a crucial role in the capabilities of general models

[54] [55], but related corpora are extremely scarce. Therefore, we employ a

method similar to the data-recalling mechanism in DeepSeek-Math [34]. This

method recalls high-quality data from CommonCrawl (CC), focusing on three

domains: code, math, and Wiki. The code and math data enhance the model’s

reasoning capabilities, while Wikidata enriches the model’s knowledge. This

process includes seed acquisition, URL aggregation, and fastText-based recall,

as shown in Fig. 2.

Seed collection: For the mathematical and code data, We choose Open-

WebMath [27], StackOverﬂow pages, and Wikipedia pages as our English initial

seeds. Public available Chinese training datasets are limited to mathematics and

code. Reference to AutoMathText [59], we prompt a base LLM to autonomously

annotate data for topic relevance and educational value, subsequently retrieving

the top 50K entries as Chinese initial seeds. For knowledge, we employ an LLM

to annotate Wikipedia data with educational scores and subsequently train a

classiﬁer. We then collect 50,000 high-quality seeds from wiki sources such as

Wikipedia. We train a fastText model using collected seed data as the positive

samples and random CC documents as the negative samples.

URL aggregation: Due to the insuﬃcient diversity of the seeds, many

target data remain uncollected after the ﬁrst round of recall. We further enhance

the diversity of the seeds through URL aggregation. We manually annotate sub

URLs (e.g., cloud.tencent.com/developer) from domains where over 10% of the

pages are hit and incorporate the uncollected samples into the seed set.

Iterative recall: After collecting more diverse seeds, we retrain the fastText

model and further recall more target webpages. We repeat the process of URL

aggregation and fastText retraining until over 98% of the recall results have

been collected.

Figure 2: The illustration of high quality data recall.

剩余59页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2284
资源: 2353

构建可信行业领域大型语言模型：金融与医疗应用

金融科技领域的大型语言模型：BloombergGPT的介绍与应用

《ChatGPT原理与实战：大型语言模型的算法、技术和私有化》.zip

基于大型语言模型构建中文金融助手（CFGPT）的技术与应用

面向中文媒体领域的大型语言模型MediaGPT研究与应用

人工智能代理池构建自己的大型语言模型

医疗知识图谱构建与应用

大型语言模型的历史、发展和原理-入门性调查

ChatGPT 和 MATLAB 大型语言模型.pdf

LangChain：连接大型语言模型与外部世界的桥梁.zip

《2023大模型落地应用案例集》.pdf

金融行业大模型应用方案

PetroGPT石油领域大语言模型

【中国信息通信研究】2023大模型落地应用案例集

2023金融业大模型应用报告

生成式AI语言大模型简介.pptx

大型语言模型编码临床医学知识及性能评估

meta 大型语言模型 llama 2

大型语言模型ChatGPT在内容排名与人类偏好的一致性研究

预测模型的应用和发展.pdf

Python_由PKUKCL开发的一系列代码大型语言模型.zip

预训练大模型与医疗：从算法研究到应用-清华大学自动化系-2024

手机安全和可信应用开发指南：TrustZone与OP-TEE技术详解 (网络空间安全技术丛书)1

基于中文金融知识的LLaMA系微调模型的智能问答系统：LLaMA大模型训练微调推理等详细教学

《2024大模型典型示范应用案例集》

LLM Concepts Guide - 谷歌大型语言模型概念指南.pdf

基于python的金融文本情感分析模型代码实现

最新资源