Cost-Driven Active Learning with Semi-Supervised
Cluster Tree for Text Classification
Zhaocai Sun
2
, Yunming Ye
2
,YanLi
1
, Shengchun Deng
3
, and Xiaolin Du
2
1
Shenzhen Graduate School, Harbin Institute of Technology
2
School of Computer Engineering, Shenzhen Polytechnic
3
Department of Computer Science, Harbin Institute of Technology
zhcsun@hotmail.com,yeyunming@hit.edu.cn,liyan@szpt.edu.cn,
dengshengchun@hit.edu.cn,duxiaolin@gmail.com
Abstract. The key idea of active learning is that it can perform better with less
data or costs if a machine learner is allowed to choose the data actively. However,
the relation between labeling cost and model performance is seldom studied in
the literature. In this paper, we thoroughly study this problem and give a crite-
rion called as cost-performance to balance this relation. Based on the criterion, a
cost-driven active SSC algorithm is proposed, which can stop the active process
automatically. Empirical results show that our method outperforms active SVM
and co-EMT.
Keywords: Active Learning, Semi-supervised Learning, Cluster Tree.
1 Introduction
In intelligent information and data mining, less training data but with a higher per-
formance is always desired[11,20,17]. However, it is empirically shown in most cases
that the performance of a classifier is positively correlated to the number of training
samples. Active learning is then proposed to solve this contradicting but meaningful
problem[9,11]. Because trained on the most informative data, the active classifier can
also achieve a high performance, even though less training data is used than ordinary
classifiers. This famous phenomenon is called “less is more”[9].
In general, the procedure of an active learner is summarized as follows[11,17]. 1,
Only the most uncertain data (e.g. near to the class margin) are sampled from the un-
labeled data set; 2, The sampled data are labeled by the “oracle” and added into the
labeled data set; 3, On the new and enlarged labeled data set, the active classifier is re-
trained. Through repeating the above three steps T times, the active classifier becomes
stronger and stronger. It is noticed that, in most active algorithms, T must be pre-set
(e.g. T =50). That is, the drawback of current active algorithms is that there is no mech-
anism to stop the active process automatically. In other words, there is no criterion to
compare two classifiers in the process of active learning. Although performing better
than the prior classifier generally, the posterior classifier used more labeled data for
training, thus has the higher cost. So, it needs a criterion to balance the performance
and the cost of the active classifier.
J. Sobecki, V. Boonjing, and S. Chittayasothorn (eds.), Advanced Approaches to Intelligent 47
Information and Database Systems, Studies in Computational Intelligence 551,
DOI: 10.1007/978-3-319-05503-9_5,
c
Springer International Publishing Switzerland 2014