711
Language Processing). But few studies have used this
model to solve the sparseness problem of short texts.
• We use the data set we already have to train our N-
Gram model, and build the feature extension mode
library. Different from extension methods using
external data sets, our method is adaptive to large scale
data sets, because repeatedly querying search engines
is quite time-consuming. We can quickly extract n-
grams and build the feature extension library from our
data set itself.
• The results of our experiments show that our feature
extension method does increase the feature density of
original short texts. And classifier we trained can get
about 10% accuracy improvement.
The rest of this paper is organized as follows. In Section II,
we introduce the related work about short text classification. In
Section III and IV, we describe our scheme in detail. Then we
present experiments and results in Section V. And then in
Section VI, we present our conclusions and future work.
II. R
ELATED WORK
Up to now, researchers have proposed several kinds of
methods to solve the sparseness problem in short text
classification. And the result proves that many methods can
present the short text better and categorize these texts more
accurately.
Some researchers tried to reduce the spatial dimension
based on sematic analysis[3]. Latent topic models like LSA(
latent semantic analysis ), pLSA( probability latent analysis )
and LDA( latent Dirichlet allocation ) are often used. Latent
topic models can extract topic words from short texts. Then
short texts can be transferred into the “text-topic” space with
rather low dimension from the “text-feature” space which are
often with high dimension. In the “text-topic” space,
synonymous words present similar or same topics. So we can
mining the inner information and semantic structure of the text
set. In this way, we can improve the efficiency of short text
analysis. Bing-kun WANG etc.[4] presented a new method to
tackle the problem by building a strong feature thesaurus (SFT)
based on latent Dirichlet allocation (LDA) and information
gain (IG) models. Language independent semantic (LIS)
kernel[5] was proposed to enhance the language-dependency,
when exploiting syntactic or semantic information. It was able
to effectively compute the similarity between short texts
without using grammatical tags and lexical databases. Mengen
Chen etc.[6] proposed the method that extracted topics at
multiple granularities, which could model the short text more
precisely.
Some researchers found novelty methods to select features
for texts to solve the sparseness problem. Instead of the
traditional BOW( Bag of Words ) model, Sriram etc.[12]
analysed the characteristics of texts in twitter and proposed the
8-F(eight features like presence of shortening of words and
slangs, time-event phrases, opinioned words and so on ) model
to present short texts. Sun[13] tried to mimic human voting
action to classify short texts. Meng Wang[7] etc. improved the
traditional TF-IDF algorithm for feature selection and weight
computation and proposed a new concept called DFICF and
made feature selection based on Manual Information, where
ICF is the inverse category frequency of a word. Yuan[8] etc.
tried to optimize the classifier to get a better performance on
sparse data sets. They used Naïve Bayes Classifier and four
smoothing methods and carried out their experiments on
Yahoo! Q&A data sets. They found out that some proper
smoothing methods could improve the accuracy of Naïve
Bayes Classifier on short texts in a large degree.
Many methods proposed to solve the sparseness problem
were based on feature extension. Some researchers used
external resources like searching engines and open source
knowledge to expand short texts. Researchers tried to use
public searching engines to help with similarity computation of
short texts or even two single terms[9,10,11]. They used short
texts and terms as query words putting into searching engines
and used the result to expand texts or terms. Banerjee[14]
combined information retrieval techniques and the Wikipedia
data together. They built a search engine based on Wikipedia
data, and used short texts as query keywords, and then
extended features based on the result. In the work of X.Hu[15],
different external data resources are used. If the text had more
than one feature, the Wikipedia data was used to extend the
text. And if the text had only one feature, then WorldNet was
used to extend the text. In the opposite direction of enriching
the short texts with external resources, some researchers focus
on deep mining of the short texts themselves. Xinhua Fan[16]
made use of term co-occurrence to build a feature extension
model, and adjusted measures like confidence, relevancy
strength to improve the quality of feature extension library.
Easy to see, feature extension based on searching engines
or external data resources is time consuming, especially when
data sets are very large. And sometimes, the semantic similarity
is not the only measurement for results returning, such as the
payment ranking of Baidu. And extension using term co-
occurrence is easy to bring in noise features. Co-occurrence
considers the frequency of two terms showing together, no
matter how far the two words are. Actually two words in
different sentences may not have strong semantic relation. So
in this paper, we use the N-Gram model to build the feature
extension mode library. Considering the length of short texts,
the bigram model is actually used in our experiment.
III. O
VERALL FRAMEWORK
The overall framework of our job is shown as Fig.1.
Feagure1 The overall framework of our work
First of all, we have to preprocess all the texts. That means
splitting texts into word consequences, feature selection, TF-