成本驱动的主动学习与半监督集群树的文本分类资源-CSDN文库

134 浏览量 2021-03-28 23:00:37 上传评论收藏 420KB PDF 举报

本文研究的是主动学习在半监督集群树中的应用以及在文本分类任务中的成本驱动策略，核心概念是主动学习可以在较少数据或成本的情况下，通过机器学习者主动选择数据，达到更高的性能。本文深入研究了标签成本与模型性能之间的关系，并提出了一种称为成本性能的准则，以平衡这一关系。基于此准则，提出了一种成本驱动的主动半监督聚类算法（cost-driven active SSC），该算法可以自动停止主动学习过程。实证结果表明，该方法优于active SVM和co-EMT。主动学习是机器学习领域的一个重要分支，它允许机器学习模型在训练过程中主动选择最有信息量的数据点来进行标注，并将这些数据用于模型的进一步训练。这与传统的被动学习方法不同，被动学习往往依赖于大量预先标注的数据进行训练。主动学习的核心优势在于能够用更少的标注数据达到甚至超越传统方法的性能，即“少即是多”的现象。在主动学习过程中，一般包括以下三个步骤：从未标注的数据集中选择最不确定的数据点（例如，接近类别边界的数据）；由“预言机”（oracle）对这些数据点进行标注，并将标注后的数据添加到已标注数据集中；在新的和扩充的已标注数据集上重新训练主动学习器。通过重复以上三个步骤T次，主动学习器会变得越来越强大。然而，现有的大多数主动学习算法存在一个缺陷，即没有机制能够在适当的时刻自动停止主动学习过程。例如，通常需要预先设定T的值（例如T=50）。这就意味着，在某些情况下，可能需要进行过多不必要的标注操作，从而导致不必要的成本增加。为了克服这一问题，本文提出了一个基于成本性能准则的成本驱动的主动学习策略。该策略考虑了标注成本和模型性能之间的关系，并能够在成本与性能之间找到平衡点。基于这一准则，本文提出了一种成本驱动的主动半监督聚类算法（cost-driven active SSC），它能够自动决定何时停止主动学习过程。本研究对主动学习算法的发展具有重要意义，尤其是在如何提高算法的经济效率方面。通过更智能的主动学习策略，可以在降低标注成本的同时保持甚至提升模型的分类性能。这对于那些标注资源有限的应用场景尤其有价值，例如医疗诊断、金融风险分析和网络内容过滤等。本文的研究还涉及到半监督学习和集群树的概念。半监督学习是一种结合了有监督学习和无监督学习的学习范式，旨在利用少量的标注数据和大量的未标注数据进行学习，从而实现更好的学习效果。集群树是一种常用于聚类分析的数据结构，它能够通过数据点之间的相似性来组织数据，并将相似的数据点聚集在一起形成簇。在半监督学习中，集群树可以用于识别和组织数据集中的数据结构，从而有助于提高学习算法的性能。主动学习与半监督学习的结合，在文本分类任务中具有广泛的应用前景。文本分类是自然语言处理中的一个核心任务，其目标是将文本数据划分到一个或多个预定义的类别中。在诸如情感分析、垃圾邮件识别、新闻分类等场景中，主动学习与半监督学习结合的策略能够在确保分类性能的前提下，大幅度降低人工标注的代价，提高分类系统的实际应用价值。

资源推荐

资源详情

资源评论

Cost-Driven Active Learning with Semi-Supervised

Cluster Tree for Text Classiﬁcation

Zhaocai Sun

, Yunming Ye

,YanLi

, Shengchun Deng

, and Xiaolin Du

Shenzhen Graduate School, Harbin Institute of Technology

School of Computer Engineering, Shenzhen Polytechnic

Department of Computer Science, Harbin Institute of Technology

zhcsun@hotmail.com,yeyunming@hit.edu.cn,liyan@szpt.edu.cn,

dengshengchun@hit.edu.cn,duxiaolin@gmail.com

Abstract. The key idea of active learning is that it can perform better with less

data or costs if a machine learner is allowed to choose the data actively. However,

the relation between labeling cost and model performance is seldom studied in

the literature. In this paper, we thoroughly study this problem and give a crite-

rion called as cost-performance to balance this relation. Based on the criterion, a

cost-driven active SSC algorithm is proposed, which can stop the active process

automatically. Empirical results show that our method outperforms active SVM

and co-EMT.

Keywords: Active Learning, Semi-supervised Learning, Cluster Tree.

1 Introduction

In intelligent information and data mining, less training data but with a higher per-

formance is always desired[11,20,17]. However, it is empirically shown in most cases

that the performance of a classiﬁer is positively correlated to the number of training

samples. Active learning is then proposed to solve this contradicting but meaningful

problem[9,11]. Because trained on the most informative data, the active classiﬁer can

also achieve a high performance, even though less training data is used than ordinary

classiﬁers. This famous phenomenon is called “less is more”[9].

In general, the procedure of an active learner is summarized as follows[11,17]. 1,

Only the most uncertain data (e.g. near to the class margin) are sampled from the un-

labeled data set; 2, The sampled data are labeled by the “oracle” and added into the

labeled data set; 3, On the new and enlarged labeled data set, the active classiﬁer is re-

trained. Through repeating the above three steps T times, the active classiﬁer becomes

stronger and stronger. It is noticed that, in most active algorithms, T must be pre-set

(e.g. T =50). That is, the drawback of current active algorithms is that there is no mech-

anism to stop the active process automatically. In other words, there is no criterion to

compare two classiﬁers in the process of active learning. Although performing better

than the prior classiﬁer generally, the posterior classiﬁer used more labeled data for

training, thus has the higher cost. So, it needs a criterion to balance the performance

and the cost of the active classiﬁer.

J. Sobecki, V. Boonjing, and S. Chittayasothorn (eds.), Advanced Approaches to Intelligent 47

Information and Database Systems, Studies in Computational Intelligence 551,

DOI: 10.1007/978-3-319-05503-9_5,

 Springer International Publishing Switzerland 2014

48 Z. Sun et al.

In this paper, we thoroughly study the relation of performance and cost, and pro-

pose a criterion for assessing the active classiﬁer, which is termed as cost-performance.

Moreover, based on the classiﬁcation model of active SSC (Semi-Supervised Cluster

tree), we give a method to estimate the cost-performance and design a mechanism for

automatically stopping the active process.

The rest of this paper is organized as follows. Section 2 reviews related works about

active learning, in which the active SSC algorithm is emphasizing. Section 3 gives the

theoretical analysis of the cost-performance. Section 4 proposes our cost-driven active

SSC algorithm, which can automatically ﬁnd out the classiﬁer with the highest cost-

performance. Section 5 and Section 6 are experiments and conclusions respectively.

2 Related Works

2.1 Active Learning Methods

The key idea of active learning is that if a machine learner is allowed to choose the

data actively then it can perform better with less costs. So as far, it has widely applied

to the ﬁeld in which labeled samples are hard or expensive to extract, such as speech

recognition[4], image retrieval[5], web page categorization[19,16], and so on. In gen-

eral, based on a classiﬁcation model (e.g. SVM or logistic regression), active learning

focuses on the strategy of sampling the most informative data. It includes Uncertainty

Sampling[3], Query-By-Committee[13], Expected Model Change[12], Expected Error

Reduction[8], Variance Reduction[2], Density-Weighted Methods[10], etc. In recent

years, some researchers also combined semi-supervised learning and active learning

to obtain a better learner[20].

2.2 Active Semi-Supervised Cluster Tree Review

SSC (Semi-Supervised Cluster) algorithm is a semi-supervised method, which build

a tree classiﬁer (illustrated in Fig.1) with both labeled data and unlabeled data. Not

like C4.5 [7] or CART [14], a clustering algorithm, specially k-means, is adopted to

Fig.1. Semi-supervised cluster tree. Black dots denote unlabeled data, others are labeled data.

剩余10页未读，继续阅读

评论收藏

内容反馈

weixin_38595689

粉丝: 4
资源: 910

成本驱动的主动学习与半监督集群树的文本分类

半监督学习和LDA模型的文本分类方法1

基于极限学习机的主动学习与半监督学习相结合的多类图像分类

融合半监督学习与主动学习的细分领域新闻分类研究.docx

一种半监督学习的金融新闻文本分类算法.docx

disaster-response-pipeline:ETL管道与ML模型相结合，用于监督学习和网格搜索，以对灾难事件期间发送的文本消息进行分类

集群分类数据库集群数据库集群

行业分类-设备装置-大规模跨媒体数据分布式半监督内容识别分类方法及装置.zip

电影产业集群组织模式对其升级驱动力的影响研究

遥感影像监督分类与非监督分类及相关代码实现

考虑分时电价和电池损耗的电动汽车集群V2G响应成本分析.pdf

基于在线半监督主动学习的交通场景监控目标分类

论文研究-结合半监督与主动学习的时间序列PU问题分类.pdf

集群技术及应用专题_二_集群的分类技术特性和应用特点

clickhouse集群部署说明-ck集群

一种基于机器学习的Spark容器集群性能提升方法.pdf

基于Hadoop平台的海量文本分类的并行化

Hadoop课程设计-基于Java和mapreduce实现的贝叶斯文本分类器设计

基于deeplearning4j深度学习框架实现的价格预测，语义分析，文本分类等场景的代码.zip

国外最新书籍《文本分类、聚类及应用》

ESXI虚拟集群HBA卡brocade825的驱动

学习资料_linux_集群

最新资源