使用N-Gram模型基于特征扩展的短文本分类

44 浏览量 2021-03-06 03:49:45 上传评论收藏 505KB PDF 举报

近年来，随着Web 2.0技术的发展，人们越来越喜欢通过社交网站或论坛展示自己的生活或观点，这造成了大量短文本的产生。例如，微博、推特（Twitter）和天涯论坛等平台上用户上传了大量的短文本信息。由于短文本长度有限，缺乏足够的信号，特征稀疏性问题严重，使用传统方法进行高质量分类变得相当困难。本文提出了基于N-Gram模型的特征扩展新方法来解决特征稀疏问题。通过从训练集中的连续词序列提取n-grams作为特征扩展模型库，然后使用出现在短文本中的特征计算其他未出现在原文本中的单词的出现概率。本文使用从新浪微博收集的数据集执行扩展方法，利用朴素贝叶斯算法训练并评估分类器。使用精确率、召回率和F1分数来评估我们的工作。结果显示，基于N-Gram模型的扩展方法可以明显提高分类性能。为更好地理解这个主题，我们可以从以下几个方面深入了解知识点： 1. 短文本分类的重要性：在Web 2.0时代，短文本信息呈爆炸性增长，尤其在社交网络和论坛上。短文本分类作为文本分类的一个分支，在管理这类信息中扮演了重要角色。由于社交媒体上的信息大多是简短的，所以分类工作面临很大的挑战，主要是因为文本长度短导致的信号不足和特征稀疏性问题。 2. 特征稀疏性问题：特征稀疏性指的是在文本数据中，词频较低的词汇在整个语料库中分布非常广泛，导致对具体文档的分类贡献度不明显。对于短文本而言，这一问题尤为突出，因为其有限的文本长度难以提供足够的特征用于有效分类。 3. N-Gram模型：N-Gram模型是一种基于统计的语言模型，主要用于预测下一个词或一组词。它可以捕捉文本中的局部结构，通过考虑n个连续词的序列来提取特征。N-Gram模型在自然语言处理领域有广泛的应用，尤其是在机器翻译、语音识别和文本分类等任务中。本文提出的方法使用N-Gram模型提取特征，来解决短文本分类中的特征稀疏性问题。 4. 特征扩展方法：特征扩展是指通过一定的算法对原始数据集的特征进行扩充的过程。在本研究中，特征扩展方法基于N-Gram模型，通过分析训练集中连续词序列来提取n-grams作为模型库，然后利用这些n-grams来推断短文本中未出现的单词的概率。特征扩展方法能够显著提高分类的性能。 5. 朴素贝叶斯分类算法：朴素贝叶斯分类器是基于贝叶斯定理的一种简单概率分类方法，它假设特征之间相互独立。尽管这一假设在实际中往往不成立，朴素贝叶斯分类器在文本分类任务中表现通常很好。本文使用朴素贝叶斯算法来训练和评估扩展特征后的分类器。 6. 评价指标：精确率、召回率和F1分数是评价分类器性能的常用指标。精确率关注于被正确预测为正例的样本占所有被预测为正例样本的比例；召回率关注于正确预测为正例的样本占所有实际正例样本的比例；F1分数则是精确率和召回率的调和平均，用于衡量分类器的综合性能。 7. 实验和结果：本文通过从新浪微博收集的大型数据集来验证所提方法的有效性。实验结果表明，基于N-Gram模型的特征扩展方法能显著提升分类性能。通过上述内容，我们可以理解短文本分类面临的挑战，以及N-Gram模型和特征扩展方法在解决这些问题中的应用。此外，朴素贝叶斯算法和分类性能评价指标对于实验结果的准确评估也起着至关重要的作用。

资源推荐

资源详情

资源评论

2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)

Short Text Classification Based on Feature

Extension Using The N-Gram Model

Xinwei Zhang, Bin Wu

Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia

Beijing University of Posts and Telecommunications

Beijing , China

zhangxinwei0628@bupt.edu.cn, wubin@bupt.edu.cn

Abstract—With the rapid development of Web2.0, more and

more people like to show their life or opinions on social media

websites or forums, such as Weibo, Twitter and Tianya, which

produce masses of short texts. In order to manage these short

texts effectively, Short Text Classification becomes an important

branch of Text Classification. However, because of the short text

length, the lack of signals, and the sparseness of features, it is

very difficult to achieve high quality classification by using

conventional methods. This paper proposes a novelty feature

extending method based on the N-Gram model to solve the

problem of feature sparseness. From continuous word sequences

in the train set, we extract n-grams as our feature extension mode

library. Then using features showing in the short texts, we can

compute the appearance probability of other words that do not

exist in original texts. We use the data set collected from Sina

Weibo to carry out our extension method. After extending

features of the original short texts, we use the Naïve Bayes

algorithm to train and evaluate a classifier. We use precision,

recall and F1-Score to evaluate our work. The result shows that

the extension method based on the N-Gram model can improve

classification performance observably.

Keywords-Short Text; Classification; The N-Gram Model;

Feature Extension; Naïve Bayes

I. INTRODUCTION

Because of the popularity of social networks, forums, texts

with very short length rush into our life, such as, micro-blogs,

short messages, comments and interactions on forums. Short

texts contain plenty of information throughout all fields like

sports, education, science and so on.

Take Sina Weibo as an example. Tons of Weibo users

including movie stars, writers, singers, specialists and even

government institutions post all kinds of information or news

on their personal pages. Thus, massive amounts of texts are

produced on the Sina Weibo every day. When faced with so

many micro-blogs, we may get confused and it is very hard for

us to pick out information we need or are interested in

effectively. To organize these micro-blogs in different

categories will be a good way. Also micro-blogs classification

is the fundamental researching work for many other studies

like mining users’ interest, hot topic detection and sentiment

analysis. In general, Micro-blog classification research has an

important significance.

Compared with normal texts, short texts have following

characteristics[1,2]:

Sparseness. Like product description, forum remarks and

short messages, short texts all have very short length with no

more than 100 words. This leads to the sparseness of features

and it is very hard to extract accurate and key features for

classification learning.

Irregular. The grammar of short texts is usually informal.

There often exist spelling mistakes and face expression

symbols. These irregular words contain a lot of noisy features

and thus increase the difficulty of information extraction for

computers.

Real-time. Short texts on the internet are frequently

refreshed, so the amount of short texts is very huge. This

means that the classification algorithm must have very good

time performance.

Because of these characteristics mentioned, many classic

classification algorithms failed to achieve a high accuracy

when working with short texts. During the past few years,

researchers have come up with several new methods to help

with short text classification. One way is to use topic models

like LDA, LSP and pLSA to analyze the latent topic of data

sets [3,4,5]. Using the statistic method and latent semantic

analysis can extract potential semantic structure, eliminate the

synonymous influence, and reduce feature dimensions and

noises. Another way is based on feature extension. In this type

of studies, search engines [9,10,11] and public background

knowledge like WorldNet or HowNet [14,15] are often used.

Also there are many studies doesn’t use any external data

resource. In these works, new methods for feature selection like

8F[12] and TFICF[7] are proposed, and feature extension

based on word co-occurrence is also a good way to solve

feature sparseness[16]. Their results show that these methods

all work well in short text classification.

Inspired by the idea of feature extension, we present a

method to extend features from original features of short texts

on the basis of the N-Gram model. The main features of our

approach are as follows:

• We proposed an approach of feature extension using

the N-Gram model. The N-Gram model is a very

mature and useful language model in NLP (Natural

711

Language Processing). But few studies have used this

model to solve the sparseness problem of short texts.

• We use the data set we already have to train our N-

Gram model, and build the feature extension mode

library. Different from extension methods using

external data sets, our method is adaptive to large scale

data sets, because repeatedly querying search engines

is quite time-consuming. We can quickly extract n-

grams and build the feature extension library from our

data set itself.

• The results of our experiments show that our feature

extension method does increase the feature density of

original short texts. And classifier we trained can get

about 10% accuracy improvement.

The rest of this paper is organized as follows. In Section II,

we introduce the related work about short text classification. In

Section III and IV, we describe our scheme in detail. Then we

present experiments and results in Section V. And then in

Section VI, we present our conclusions and future work.

II. R

ELATED WORK

Up to now, researchers have proposed several kinds of

methods to solve the sparseness problem in short text

classification. And the result proves that many methods can

present the short text better and categorize these texts more

accurately.

Some researchers tried to reduce the spatial dimension

based on sematic analysis[3]. Latent topic models like LSA(

latent semantic analysis ), pLSA( probability latent analysis )

and LDA( latent Dirichlet allocation ) are often used. Latent

topic models can extract topic words from short texts. Then

short texts can be transferred into the “text-topic” space with

rather low dimension from the “text-feature” space which are

often with high dimension. In the “text-topic” space,

synonymous words present similar or same topics. So we can

mining the inner information and semantic structure of the text

set. In this way, we can improve the efficiency of short text

analysis. Bing-kun WANG etc.[4] presented a new method to

tackle the problem by building a strong feature thesaurus (SFT)

based on latent Dirichlet allocation (LDA) and information

gain (IG) models. Language independent semantic (LIS)

kernel[5] was proposed to enhance the language-dependency,

when exploiting syntactic or semantic information. It was able

to effectively compute the similarity between short texts

without using grammatical tags and lexical databases. Mengen

Chen etc.[6] proposed the method that extracted topics at

multiple granularities, which could model the short text more

precisely.

Some researchers found novelty methods to select features

for texts to solve the sparseness problem. Instead of the

traditional BOW( Bag of Words ) model, Sriram etc.[12]

analysed the characteristics of texts in twitter and proposed the

8-F(eight features like presence of shortening of words and

slangs, time-event phrases, opinioned words and so on ) model

to present short texts. Sun[13] tried to mimic human voting

action to classify short texts. Meng Wang[7] etc. improved the

traditional TF-IDF algorithm for feature selection and weight

computation and proposed a new concept called DFICF and

made feature selection based on Manual Information, where

ICF is the inverse category frequency of a word. Yuan[8] etc.

tried to optimize the classifier to get a better performance on

sparse data sets. They used Naïve Bayes Classifier and four

smoothing methods and carried out their experiments on

Yahoo! Q&A data sets. They found out that some proper

smoothing methods could improve the accuracy of Naïve

Bayes Classifier on short texts in a large degree.

Many methods proposed to solve the sparseness problem

were based on feature extension. Some researchers used

external resources like searching engines and open source

knowledge to expand short texts. Researchers tried to use

public searching engines to help with similarity computation of

short texts or even two single terms[9,10,11]. They used short

texts and terms as query words putting into searching engines

and used the result to expand texts or terms. Banerjee[14]

combined information retrieval techniques and the Wikipedia

data together. They built a search engine based on Wikipedia

data, and used short texts as query keywords, and then

extended features based on the result. In the work of X.Hu[15],

different external data resources are used. If the text had more

than one feature, the Wikipedia data was used to extend the

text. And if the text had only one feature, then WorldNet was

used to extend the text. In the opposite direction of enriching

the short texts with external resources, some researchers focus

on deep mining of the short texts themselves. Xinhua Fan[16]

made use of term co-occurrence to build a feature extension

model, and adjusted measures like confidence, relevancy

strength to improve the quality of feature extension library.

Easy to see, feature extension based on searching engines

or external data resources is time consuming, especially when

data sets are very large. And sometimes, the semantic similarity

is not the only measurement for results returning, such as the

payment ranking of Baidu. And extension using term co-

occurrence is easy to bring in noise features. Co-occurrence

considers the frequency of two terms showing together, no

matter how far the two words are. Actually two words in

different sentences may not have strong semantic relation. So

in this paper, we use the N-Gram model to build the feature

extension mode library. Considering the length of short texts,

the bigram model is actually used in our experiment.

III. O

VERALL FRAMEWORK

The overall framework of our job is shown as Fig.1.

Feagure1 The overall framework of our work

First of all, we have to preprocess all the texts. That means

splitting texts into word consequences, feature selection, TF-

剩余6页未读，继续阅读

评论收藏

内容反馈

weixin_38522106

粉丝: 2
资源: 900

使用N-Gram模型基于特征扩展的短文本分类

基于N-Gram的计算机病毒特征码自动提取的改进方法.7z

N-gram语言模型

一种基于N-gram模型和机器学习的汉语分词算法.pdf

N-gram特征提取

nlp数据包 用于分词，n-gram模型，情感分析等

word2vec Skip-Gram模型的简单实现

基于n-gram的文本分类

毕业论文范文基于N-Gram的G蛋白偶联序列分类方法的研究

ngram模型分词与统计算法.zip_NGram 算法_ngram 分词_ngram模型分词与统计算法_n元模型_按n-gram

n-gram-tree:用Java编写的n-gram模型

第五章n-gram.ppt

基于CORDIC的反正弦和反余弦计算的FPGA实现

使用3DCNN和卷积LSTM进行手势识别学习时空特征

BA无标度网络中的SIR模型

基于三次贝塞尔曲线的类汽车曲率连续路径平滑

基于机器学习的设备剩余寿命预测方法综述

基于维纳过程的退化模型，具有递归过滤算法，可用于估计剩余使用寿命

基于FPGA的奇异值和特征值分解的快速实现。

基于BP神经网络的人口预测

最新资源

nlp数据包用于分词，n-gram模型，情感分析等