没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Classification of short text is challenging due to its severe sparseness and high dimension, which are typical characteristics of short text. In this paper, we propose a novel approach to classify short texts based on both lexical and semantic features. Firstly, the term dictionary is constructed by selecting lexical features that are most representative words of a certain category, and then the optimal topic distribution from the background knowledge repository is extracted via Latent Dirichlet
资源推荐
资源详情
资源评论
Effectively Classifying Short Texts
via Improved Lexical Category
and Semantic Features
Huifang Ma
(&)
, Runan Zhou, Fang Liu, and Xiaoyong Lu
College of Computer Science and Engineering,
Northwest Normal University, Lanzhou, China
mahuifang@yeah.net
Abstract. Classification of short text is challenging due to its severe sparseness
and high dimension, which are typical characteristics of short text. In this paper,
we propose a novel approach to classify short texts based on both lexical and
semantic features. Firstly, the term dictionary is constructed by selecting lexical
features that are most representative words of a certain category, and then the
optimal topic distribution from the background knowledge repository is
extracted via Latent Dirichlet Allocation. The new feature for short text is
thereafter constructed. The experimental results show that our method achieved
significant quality enhancement in terms of short text classification.
Keywords: Short text classification
Latent Dirichlet allocation Lexical
features
Semantic features Optimal topic distribution
1 Introduction
With the rapid development of the social network, we are now dealing with much more
short texts in various applications. Examples are snippets in search results, tweets,
status updates, news comments, and reviews from various social platforms. There is an
urgent demand to interpret these short texts efficiently and effectively. However, short
texts do not provide sufficient word occurrences, traditional methods [1] such as
“Bag-Of-Words” fail to represent these short texts accurat ely due to their
high-dimension and severe sparsity. One crucial question, for conventional classifying
methods like support vector machine (SVM) [2]andk nearest neighbors (k-NN) [3], is
how to select representative features in such short documents. The characteristics of
short text will hinder the application of conventional machine learning and text mining
algorithms.
Most existing approaches have attempted to enrich short text to get more features,
which combine more semantic, contexts and associations. Existing work in the liter-
ature tried to address the aforementioned challenges from two directions. One was to
employ search engines and utilize the search results to expand related contextual
content [4, 5] while the other was to utilize external repositories as background
knowledge [6, 7]. These two types of methods can solve somewhat the problem, but
still leave much space for improvement. Both methods tried to obtain more sema ntic
© Springer International Publishing Switzerland 2016
D.-S. Huang et al. (Eds.): ICIC 2016, Part I, LNCS 9771, pp. 163–174, 2016.
DOI: 10.1007/978-3-319-42291-6_16
information by retrieving terms in search engines, Wikipedia or other resources,
therefore it will not only go against some irrelevant information but also consume much
more time.
In the opposite direction of enriching short text representation, some researchers [8]
proposed to trim a short text representation to get a few most representative words for
short text classification. Others attempted to represent short text via effective feature
selection method to reduce the dimension of feature space. Short texts can then be
resented by these selected features. Several feature selection measures have been put
forward to reduce dimensionality in the past years, such as term frequency-inverse
document frequency (TF-IDF), information gain (IG), mutual information (MI) and
expected cross entropy (ECE), etc. [9].
In recent years, topic modeling approaches, such as Latent Dirichlet distribution
(LDA) [10], are widely applied for text classification. One typical example is that Blei
et al. [11] derived a set of hidden topics obtained by LDA as new features from a large
existing Web corpus. Phan et al. [12] put forward an improved method which exploits
topics with multi-granularity based on LDA, modeling the short text more preci sely to
certain extent. Vo and Ock [5] proposed methods for enhancing features using topic
models, which make short text seem less sparse and more topic-oriented for classifi-
cation. These approaches provided new methods for enhancing features by combining
external texts from topic models that make documents more effective for class ification.
This paper presents a short text classification method based on lexical category
features and optimal topic distribution. Firstly, a mechanism is introd uced to build a
term dictionary by selecting lexical features which are the most representative words of
a category. And then the optimal topic distribution of the background knowledge
repository is obtained via LDA. Then the original features are extended by combination
of both the selected features and the optimal topic distribution on which the SVM
classifier and k-NN are trained. Finally, experimental evaluation on real text data is
conducted.
The remainder of this paper is organized as follows. We describe the relevant
theoretical knowledge in Sect. 2. Section 3 introduces the proposed general framework
combining lexical category features and semantics for short text classification.
Experimental designs and findings are presented in Sect. 4. Section 5 conclu des the
proposed work and points out our future work.
2 Problem Preliminaries
In this paper, we mainly focus on news titles which appear frequently on social net-
works, and the background knowledge is extracted from news content. Let D ={d
1
,
d
2
, …, d
m
} be the short text corpus, where m is the number of texts in D. W ={t
1
,
t
2
, … , t
n
} denotes the vocabulary of D, where n is the number of unique words in
D. C ={c
1
,c
2
, …, c
k
} is the collection of class labels, where k is the number of class
labels. The term dictionary W( t)={t
1
,t
2
, …, t
l
} is constructed by selecting lexical
features that are the most distinctive words with regard to a certain category, where
l denotes the number of selected distinctive words from all catego ries and l ≪ n. Next,
we will introduce relevant theoretical knowledge.
164 H. Ma et al.
剩余11页未读,继续阅读
资源评论
weixin_38720009
- 粉丝: 4
- 资源: 866
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功