EffectivelyClassifyingShortTextsviaImprovedLexicalCategoryandSemanticFeatures资源-CSDN文库

113 浏览量 2021-02-22 09:12:25 上传评论收藏 584KB PDF 举报

资源推荐

资源详情

资源评论

Effectively Classifying Short Texts

via Improved Lexical Category

and Semantic Features

Huifang Ma

(&)

, Runan Zhou, Fang Liu, and Xiaoyong Lu

College of Computer Science and Engineering,

Northwest Normal University, Lanzhou, China

mahuifang@yeah.net

Abstract. Classiﬁcation of short text is challenging due to its severe sparseness

and high dimension, which are typical characteristics of short text. In this paper,

we propose a novel approach to classify short texts based on both lexical and

semantic features. Firstly, the term dictionary is constructed by selecting lexical

features that are most representative words of a certain category, and then the

optimal topic distribution from the background knowledge repository is

extracted via Latent Dirichlet Allocation. The new feature for short text is

thereafter constructed. The experimental results show that our method achieved

signiﬁcant quality enhancement in terms of short text classiﬁcation.

Keywords: Short text classiﬁcation

 Latent Dirichlet allocation  Lexical

features

 Semantic features  Optimal topic distribution

1 Introduction

With the rapid development of the social network, we are now dealing with much more

short texts in various applications. Examples are snippets in search results, tweets,

status updates, news comments, and reviews from various social platforms. There is an

urgent demand to interpret these short texts efﬁciently and effectively. However, short

texts do not provide sufﬁcient word occurrences, traditional methods [1] such as

“Bag-Of-Words” fail to represent these short texts accurat ely due to their

high-dimension and severe sparsity. One crucial question, for conventional classifying

methods like support vector machine (SVM) [2]andk nearest neighbors (k-NN) [3], is

how to select representative features in such short documents. The characteristics of

short text will hinder the application of conventional machine learning and text mining

algorithms.

Most existing approaches have attempted to enrich short text to get more features,

which combine more semantic, contexts and associations. Existing work in the liter-

ature tried to address the aforementioned challenges from two directions. One was to

employ search engines and utilize the search results to expand related contextual

content [4, 5] while the other was to utilize external repositories as background

knowledge [6, 7]. These two types of methods can solve somewhat the problem, but

still leave much space for improvement. Both methods tried to obtain more sema ntic

D.-S. Huang et al. (Eds.): ICIC 2016, Part I, LNCS 9771, pp. 163–174, 2016.

DOI: 10.1007/978-3-319-42291-6_16

information by retrieving terms in search engines, Wikipedia or other resources,

therefore it will not only go against some irrelevant information but also consume much

more time.

In the opposite direction of enriching short text representation, some researchers [8]

proposed to trim a short text representation to get a few most representative words for

short text classiﬁcation. Others attempted to represent short text via effective feature

selection method to reduce the dimension of feature space. Short texts can then be

resented by these selected features. Several feature selection measures have been put

forward to reduce dimensionality in the past years, such as term frequency-inverse

document frequency (TF-IDF), information gain (IG), mutual information (MI) and

expected cross entropy (ECE), etc. [9].

In recent years, topic modeling approaches, such as Latent Dirichlet distribution

(LDA) [10], are widely applied for text classiﬁcation. One typical example is that Blei

et al. [11] derived a set of hidden topics obtained by LDA as new features from a large

existing Web corpus. Phan et al. [12] put forward an improved method which exploits

topics with multi-granularity based on LDA, modeling the short text more preci sely to

certain extent. Vo and Ock [5] proposed methods for enhancing features using topic

models, which make short text seem less sparse and more topic-oriented for classiﬁ-

cation. These approaches provided new methods for enhancing features by combining

external texts from topic models that make documents more effective for class iﬁcation.

This paper presents a short text classiﬁcation method based on lexical category

features and optimal topic distribution. Firstly, a mechanism is introd uced to build a

term dictionary by selecting lexical features which are the most representative words of

a category. And then the optimal topic distribution of the background knowledge

repository is obtained via LDA. Then the original features are extended by combination

of both the selected features and the optimal topic distribution on which the SVM

classiﬁer and k-NN are trained. Finally, experimental evaluation on real text data is

conducted.

The remainder of this paper is organized as follows. We describe the relevant

theoretical knowledge in Sect. 2. Section 3 introduces the proposed general framework

combining lexical category features and semantics for short text classiﬁcation.

Experimental designs and ﬁndings are presented in Sect. 4. Section 5 conclu des the

proposed work and points out our future work.

2 Problem Preliminaries

In this paper, we mainly focus on news titles which appear frequently on social net-

works, and the background knowledge is extracted from news content. Let D ={d

, …, d

} be the short text corpus, where m is the number of texts in D. W ={t

, … , t

} denotes the vocabulary of D, where n is the number of unique words in

D. C ={c

, …, c

} is the collection of class labels, where k is the number of class

labels. The term dictionary W( t)={t

, …, t

} is constructed by selecting lexical

features that are the most distinctive words with regard to a certain category, where

l denotes the number of selected distinctive words from all catego ries and l ≪ n. Next,

we will introduce relevant theoretical knowledge.

164 H. Ma et al.

剩余11页未读，继续阅读

评论收藏

内容反馈

weixin_38720009

粉丝: 4
资源: 866

Effectively Classifying Short Texts via Improved Lexical Categor...

最新资源

Effectively Classifying Short Texts via Improved Lexical Categor...

Short Text Hashing Improved by Integrating Topic Features and Tags

ShortTextClassification:文本和短文本的分类

HTML5 Mastery: Semantics, Standards, and Styling

lexical-semantic-recognition

Semantic 3D Object Maps

利用高质量特征扩展模式对中文短文本进行分类

Graph Neural Networks_ A Review of Methods and Applications----清华大学周杰.pdf

[.Net] Professional.C#.7.and.NET.Core.2.0英文原版书籍 源码11.8M，epub格式30M，pdf格式40M

Effectively and Securely Using the Cloud Computing Paradigm

Aggregate Channel Features for Multi-view Face Detection

Deep Learning Networks for Stock Market Analysis and Prediction

Feature selection via maximizing global information gain for text classification

Learn Android Studio- Build Android Apps Quickly and Effectively

高清彩版 Learn Android Studio Build Android Apps Quickly And Effectively

Semantic Software Design

Apress_Learn_Android-Studio_Build_Android_Apps_Quickly_and_Effectively

Learn Android Studio Build Android Apps Quickly And Effectively

Image object detection and semantic segmentation based on convolutional neural network

Image Denoising Via Sparse and Redundant Representations

Working Effectively with Legacy Code.chm

Java Threads and the Concurrency Utilities(Apress,2015)

Professional C# 7 and .NET Core 2.0, 7th Edition

英文原版-Communicating Science Effectively A Research Agenda 2016 1st Edition

Ansible Configuration Management(PACKT,2ed,2015)

Learning ServiceNow: Administration and development on the Now platform 2nd epub

Python for Probability Statistics and Machine Learning

最新资源

[.Net] Professional.C#.7.and.NET.Core.2.0英文原版书籍源码11.8M，epub格式30M，pdf格式40M