基于排序学习技术的商品评论排序系统一些论文资源-CSDN文库

共6个文件

pdf：6个

排序学习，商品评论过滤，评论排序

需积分: 9 182 浏览量 2010-04-21 19:51:10 上传评论收藏 2.05MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

基于排序学习技术的商品评论过滤系统.rar （6个子文件）

基于排序学习技术的商品评论过滤系统

WSDM08 - Opinion Spam and Analysis.pdf 386KB

EMNLP07 - Low-Quality Product Review Detection in Opinion Summarization.pdf 578KB

ICEC07-Designing Novel Review Ranking Systems.pdf 251KB

SocialCom09 - Ranking Comments on the Social Web.pdf 396KB

AAAI09 - REVRANK - a Fully Unsupervised Algorithm for selecting the most helpful book reviews.pdf 303KB

WSDM08 - Finding High-Quality Content in Social Media.pdf 615KB

Finding High-Quality Content in Social Media

Eugene Agichtein

Emory University

Atlanta, USA

eugene@mathcs.emory.edu

Carlos Castillo

Yahoo! Research

Barcelona, Spain

chato@yahoo-inc.com

Debora Donato

Yahoo! Research

Barcelona, Spain

debora@yahoo-inc.com

Aristides Gionis

Yahoo! Research

Barcelona, Spain

gionis@yahoo-inc.com

Gilad Mishne

Search and Advertising

Sciences, Yahoo!

gilad@yahoo-inc.com

ABSTRACT

The quality of user-generated content varies drastically from

excellent to abuse and spam. As the availability of such con-

tent increases, the task of identifying high-quality content

in sites based on user contributions—social media sites—

becomes increasingly important. Social media in general

exhibit a rich variety of information sources: in addition to

the content itself, there is a wide array of non-content infor-

mation available, such as links between items and explicit

quality ratings from members of the community. In this pa-

per we investigate methods for exploiting such community

feedback to automatically identify high quality content. As

a test case, we focus on Yahoo! Answers, a large community

question/answering portal that is particularly rich in the

amount and types of content and social interactions avail-

able in it. We introduce a general classiﬁcation framework

for combining the evidence from diﬀerent sources of infor-

mation, that can be tuned automatically for a given social

media type and quality deﬁnition. In particular, for the

community question/answering domain, we show that our

system is able to separate high-quality items from the rest

with an accuracy close to that of humans.

Categories and Subject Descriptors

H.3 [Information Storage and Retrieval]: H.3.1 Con-

tent Analysis and Indexing – indexing methods, linguistic

processing; H.3.3 Information Search and Retrieval – infor-

mation ﬁltering, search process.

General Terms

Algorithms, Design, Experimentation.

Keywords

Social media, Community Question Answering, User Inter-

actions.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for prot or commercial advantage and that copies

bear this notice and the full citation on the rst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specic

permission and/or a fee.

WSDM'08,

February 1112, 2008, Palo Alto, California, USA.

1. INTRODUCTION

Recent years have seen a transformation in the type of

content available on the web. During the ﬁrst decade of the

web’s prominence—from the early 1990s onwards—most on-

line content resembled traditional published material: the

majority of web users were consumers of content, created

by a relatively small amount of publishers. From the early

2000s, user-generated content has become increasingly pop-

ular on the web: more and more users participate in con-

tent creation, rather than just consumption. Popular user-

generated content (or social media) domains include blogs

and web forums, social bookmarking sites, photo and video

sharing communities, as well as social networking platforms

such as Facebook and MySpace, which oﬀers a combination

of all of these with an emphasis on the relationships among

the users of the community.

Community-driven question/answering portals are a par-

ticular form of user-generated content that is gaining a large

audience in recent years. These portals, in which users an-

swer questions posed by other users, provide an alternative

channel for obtaining information on the web: rather than

browsing results of search engines, users present detailed in-

formation needs—and get direct responses authored by hu-

mans. In some markets, this information seeking behavior

is dominating over traditional web search [29].

An important diﬀerence between user-generated content

and traditional content that is particularly signiﬁcant for

knowledge-based media such as question/answering portals

is the variance in the quality of the content. As Ander-

son [3] describes, in traditional publishing—mediated by a

publisher—the typical range of quality is substantially nar-

rower than in niche, unmediated markets. The main chal-

lenge posed by content in social media sites is the fact that

the distribution of quality has high variance: from very

high-quality items to low-quality, sometimes abusive con-

tent. This makes the tasks of ﬁltering and ranking in such

systems more complex than in other domains. However, for

information-retrieval tasks, social media systems present in-

herent advantages over traditional collections of documents:

their rich structure oﬀers more available data than in other

domains. In addition to document content and link struc-

ture, social media exhibit a wide variety of user-to-document

relation types, and user-to-user interactions.

In this paper we address the task of identifying high-

quality content in community-driven question/answering sites,

exploring the beneﬁts of having additional sources of infor-

mation in this domain. As a test case, we focus on Ya-

hoo! Answers, a large portal that is particularly rich in the

amount and types of content and social interaction available

in it. We focus on the following research questions:

1. What are the elements of social media that can be used

to facilitate automated discovery of high-quality con-

tent? In addition to the content itself, there is a wide

array of non-content information available, from links

between items to explicit and implicit quality rating

from members of the community. What is the utility

of each source of information to the task of estimating

quality?

2. How are these diﬀerent factors related? Is content

alone enough for identifying high-quality items?

3. Can community feedback approximate judgments of spe-

cialists?

To our knowledge, this is the ﬁrst large-scale study of com-

bining the analysis of the content with the user feedback

in social media. In particular, we model all user interac-

tions in a principled graph-based framework (Section 3 and

Section 4), allowing us to eﬀectively combine the diﬀerent

sources of evidence in a classiﬁcation formulation. Further-

more, we investigate the utility of the diﬀerent sources of

feedback in a large-scale, experimental setting (Section 5)

over the market leading question/answering portal. Our ex-

perimental results show that these sources of evidence are

complementary, and allow our system to exhibit high accu-

racy in the task of identifying content of high quality (Sec-

tion 6). We discuss our ﬁndings and directions for future

work in Section 7, which concludes this paper.

2. BACKGROUND AND RELATED WORK

Social media content has become indispensable to millions

of users. In particular, community question/answering por-

tals are a popular destination of users looking for help with

a particular situation, for entertainment, and for community

interaction. Hence, in this paper we focus on one particu-

larly important manifestation of social media – community

question/answering sites, speciﬁcally on Yahoo! Answers.

Our work draws on signiﬁcant amount of prior research on

social media, and we outline the related work b efore intro-

ducing our framework in Section 3.

2.1 Yahoo! Answers

Yahoo! Answers

is a question/answering system where

people ask and answer questions on any topic. What makes

this system interesting is that around a seemingly trivial

question/answer paradigm, users are forming a social net-

work characterized by heterogeneous interactions. As a mat-

ter of fact, users do not only limit their activity to asking

and answering questions, but they also actively participate

in regulating the whole system. A user can vote for answers

of other users, mark interesting questions, and even report

abusive behavior. Thus, overall, each user has a threefold

role: asker, answerer and evaluator.

The central element of the Yahoo! Answers system are

questions. Each question has a lifecycle. It starts in an

“open” state where it receives answers. Then at some point

http://answers.yahoo.com/

(decided by the asker, or by an automatic timeout in the

system), the question is considered “closed,” and can receive

no further answers. At this stage, a “best answer” is se-

lected either by the asker or through a voting procedure

from other users; once a best answer is chosen, the question

is “resolved.”

As previously noted, the system is partially moderated by

the community: any user may report another user’s question

or answer as violating the community guidelines (e.g., con-

taining spam, adult-oriented content, copyrighted material,

etc.). A user can also award a question a “star”, marking it

as an interesting question, sometimes can vote for the best

answer for a question, and can give to any answer a “thumbs

up” or “thumbs down” rating, corresponding to a positive or

negative vote respectively.

Yahoo! Answers is a very popular service (according to

some reports, it reached a market share of close to 100%

about a year after its launch [27]); as a result, it hosts a

very large amount of questions and answers in a wide va-

riety of topics, making it a particularly useful domain for

examining content quality in social media. Similar exist-

ing and past services (some with a diﬀerent model) include

Amazon’s Askville

, Google Answers

, and Yedda

2.2 Related work

Link analysis in social media.

Link-based methods have

been shown to be successful for several tasks in social me-

dia [30]. In particular, link-based ranking algorithms that

were successful in estimating the quality of web pages have

been applied in this context. Two of the most prominent

link-based ranking algorithms are PageRank [25] and HITS [22].

Consider a graph G = (V, E) with vertex set V corre-

sponding to the users of a question/answer system and hav-

ing a directed edge e = (u, v) ∈ E from a user u ∈ V to

a user v ∈ V if user u has answered to at least one ques-

tion of user v. ExpertiseRank [32] corresponds to PageRank

over the transposed graph G

= (V, E

), that is, a score is

propagated from the person receiving the answer to the per-

son giving the answer. The recursion implies that if person u

was able to provide an answer to person v, and person v was

able to provide an answer to person w, then u should receive

some extra points given that he/she was able to provide an

answer to a person with a certain degree of expertise.

The HITS algorithm was applied over the same graph [8,

19] and it was shown to produce good results in ﬁnding

experts and/or good answers. The mutual reinforcement

process in this case can be interpreted as “good questions

attract go od answers” and “good answers are given to good

questions”; we examine this assumption in Section 5.2.

Propagating reputation.

Guha et al. [14] study the prob-

lem of propagating trust and distrust among Epinions

users,

who may assign positive (trust) and negative (distrust) rat-

ings to each other. The authors study ways of combining

trust and distrust and observe that, while considering trust

as a transitive property makes sense, distrust can not be

considered transitive.

http://askville.amazon.com/

http://answers.google.com/

http://yedda.com/

http://epinions.com/

Ziegler and Lausen [33] also study models for propagation

of trust. They present a taxonomy of trust metrics and dis-

cuss ways of incorporating information about distrust into

the rating scores.

Question/answering portals and forums.

The particular

context of question/answering communities we focus on in

this paper has been the object of some study in recent years.

According to Su et al. [31], the quality of answers in ques-

tion/answering portals is good on average, but the quality of

speciﬁc answers varies signiﬁcantly. In particular, in a study

of the answers to a set of questions in Yahoo! Answers, the

authors found that the fraction of correct answers to speciﬁc

questions asked by the authors of the study, varied from 17%

to 45%. The fraction of questions in their sample with at

least one good answer was much higher, varying from 65%

to 90%, meaning that a metho d for ﬁnding high-quality an-

swers can have a signiﬁcant impact in the user’s satisfaction

with the system.

Jeon et al. [17] extracted a set of features from a sample

of answers in Naver,

a Korean question/answering portal

similar to Yahoo! Answers. They built a model for answer

quality based on features derived from the particular answer

being analyzed, such as answer length, number of points

received, etc., as well as user features, such as fraction of best

answers, number of answers given, etc. Our work expands

on this by exploring a substantially larger range of features

including both structural, textual, and community features,

and by identifying quality of questions in addition to answer

quality.

Expert nding.

Zhang et al. [32] analyze data from an on-

line forum, seeking to identify users with high expertise.

They study the user answers graph in which there is a link

between users u and v if u answers a question by v, ap-

plying both ExpertiseRank and HITS to identify users with

high expertise. Their results show high correlation between

link-based metrics and the answer quality. The authors also

develop synthetic models that capture some of the charac-

teristics of the interactions among users in their dataset.

Jurczyk and Agichtein [20] show an application of the

HITS algorithm [22] to a question/answering portal. The

HITS algorithm is run on the user-answer graph. The re-

sults demonstrate that HITS is a promising approach, as the

obtained authority score is b etter correlated with the num-

ber of votes that the items receive, than simply counting the

number of answers the answerer has given in the past.

Campbell et al. [8] computed the authority score of HITS

over the user-user graph in a network of e-mail exchanges,

showing that it is more correlated to quality than other sim-

pler metrics. Dom et al. [11] studied the performance of

several link-based algorithms to rank people by expertise on

a network of e-mail exchanges, testing on both real and syn-

thetic data, and showing that in real data ExpertiseRank

outperforms HITS.

Text analysis for content quality.

Most work on estimat-

ing the quality of text has been in the ﬁeld of Automated

Essay Grading (AES), where writings of students are graded

by machines on several aspects, including compositionality,

style, accuracy, and soundness. AES systems are typically

http://naver.com/

built as text classiﬁcation tools, and use a range of prop-

erties derived from the text as features. Some of the fea-

tures employed in systems are lexical, such as word length,

measures of vocabulary irregularity via repetitiveness [7] or

uncharacteristic co-occurrence [9], and measures of topical-

ity through word and phrase frequencies [28]. Other features

take into account usage of punctuation and detection of com-

mon grammatical error (such as subject-verb disagreements)

via predeﬁned templates [4, 24]. Most platforms are com-

mercial and do not disclose full details of their internal fea-

ture set; overall, AES systems have been shown to correlate

very well with human judgments [6, 24].

A diﬀerent area of study involving text quality is read-

ability; here, the diﬃculty of text is analyzed to determine

the minimal age group able to comprehend it. Several mea-

sures of text readability have been proposed, including the

Gunning-Fog Index [15], the Flesch-Kincaid Formula [21],

and SMOG Grading [23]. All measures combine the num-

ber of syllables or words in the text with the number of

sentences—the ﬁrst b eing a crude approximation of the syn-

tactic complexity and the second of the semantic complex-

ity. Although simplistic and controversial, these methods

are widely-used and provide a rough estimation of the diﬃ-

culty of text.

Implicit feedback for ranking.

Implicit feedback from mil-

lions of web users has been shown to be a valuable source of

result quality and ranking information. In particular, clicks

on results and methods for interpreting the clicks have been

studied in references [1, 18, 2]. We apply the results on click

interpretation on web search results from these studies, as

a source of quality information in social media. As we will

show, content usage statistics are valuable, but require dif-

ferent interpretation from the web search domain.

3. CONTENT QUALITY ANALYSIS IN

SOCIAL MEDIA

We now focus on the task of ﬁnding high quality content,

and describe our overall approach to solving this problem.

Evaluation of content quality is an essential module for p er-

forming more advanced information-retrieval tasks on the

question/answering system. For instance, a quality score

can be used as input to ranking algorithms. On a high level,

our approach is to exploit features of social media that are

intuitively correlated with quality, and then train a classi-

ﬁer to appropriately select and weight the features for each

speciﬁc type of item, task, and quality deﬁnition.

In this section we identify a set of features of social media

and interactions that can be applied to the task of content-

quality identiﬁcation. In particular, we model the intrinsic

content quality (Section 3.1), the interactions between con-

tent creators and users (Section 3.2), as well as the content

usage statistics (Section 3.3). All these feature types are

used as an input to a classiﬁer that can be tuned for the

quality deﬁnition for the particular media type (Section 3.4).

In the next section, we will expand and reﬁne the feature set

speciﬁcally to match our main application domain of com-

munity question/answering portals.

3.1 Intrinsic content quality

The intrinsic quality metrics (i.e., the quality of the con-

tent of each item) that we use in this research are mostly

text-related, given that the social media items we evaluate

are primarily textual in nature. For user-generated content

of other types (e.g., photos or bookmarks), intrinsic quality

may be modeled diﬀerently.

As a baseline, we use textual features only—with all word

n-grams up to length 5 that appear in the collection more

than 3 times used as features. This straightforward ap-

proach is the de-facto standard for text classiﬁcation tasks,

both for classifying the topic and for other facets (e.g., sen-

timent classiﬁcation [26]).

Additionally, we use a large number of semantic features,

organized as follows:

Punctuation and typos.

Poor quality text, and particu-

larly of the type found in online sources, is often marked with

low conformance to common writing practices. For example,

capitalization rules may be ignored; excessive punctuation—

particularly repeated ellipsis and question marks—may be

used, or spacing may be irregular. Several of our features

capture the visual quality of the text, attempting to model

these irregularities; among these are features measuring punc-

tuation, capitalization, and spacing density (percent of all

characters), as well as features measuring the character-level

entropy of the text. A particular form of low visual qual-

ity are misspellings and typos; additional features in our

set quantify the number of spelling mistakes, as well as the

number of out-of-vocabulary words.

Syntactic and semantic complexity.

Advancing from the

punctuation level to more involved layers of the text, other

features in this subset quantify the syntactic and semantic

complexity of it. These include simple proxies for complex-

ity such as the average number of syllables per word or the

entropy of word lengths, as well as more intricate ones such

as the readability measures [15, 21, 23] mentioned in Sec-

tion 2.2.

Grammaticality.

Finally, to measure the grammatical qual-

ity of the text, we use several linguistically-oriented features.

We annotate the content with part-of-speech (POS) tags,

and use the tag n-grams (again, up to length 5) as features.

This allows us to capture, to some degree, the level of “cor-

rectness” of the grammar used.

Some part-of-speech sequences are typical of correctly-

formed questions: e.g., the sequence“when|how|why to (verb)”

(as in “how to identify. . . ”) is typical of lower-quality ques-

tions, whereas the sequence “when|how|why (verb) (personal

pronoun) (verb)” (as in “how do I remove. . . ”) is more typ-

ical of correctly-formed content.

Additional features used to represent grammatical prop-

erties of the text are its formality score [16], and the distance

between its (trigram) language model and several given lan-

guage models, such as the Wikipedia language model or the

language model of the Yahoo! Answers corpus itself (the dis-

tance is measured with KL-divergence).

To identify out-of-vocabulary words, we construct multiple

lists of the k most frequent words in Yahoo! Answers, with

several k values ranging between 50 and 5000. These lists are

then used to calculate a set of “out-of-vocabulary” features,

where each feature assumes the list of top-k words for some

k is the vocabulary. An example feature created this way is

“the fraction of words in an answer that do not appear in

the top-1000 words of the collection.”

3.2 User relationships

A signiﬁcant amount of quality information can be in-

ferred from the relationships between users and items. For

example, we could apply link-analysis algorithms for propa-

gating quality scores in the entities of the question/answer

system, e.g., we use the intuition that, “good” answerers

write “good” answers, or vote for other “good” answerers.

The main challenge we have to face is that our dataset,

viewed as a graph, often contains nodes of multiple types

(e.g., questions, answers, users), and edges represent a set

of interaction among the nodes having diﬀerent semantics

(e.g., “answers”, “gives best answer”, “votes for”, “gives a

star to”).

These relationships are represented as edges in a graph,

with content items and users as nodes. The edges are typed,

i.e., labeled with the particular type of interaction (e.g.,

“User u answers question q”). Besides the user-item rela-

tionship graph, we also consider the user-user graph. This

is the graph G = (V, E) in which the set of vertices V is

composed of the set of users, and the set E represents im-

plicit relationships between users. For example, a user-user

relationship could be “User u has answered a question from

user v.”

The resulting user-user graph is extremely rich and het-

erogeneous, and is unlike traditional graphs studied in the

web link analysis setting. However, we believe that (in our

classiﬁcation framework) traditional link analysis algorithm

may provide useful evidence for quality classiﬁcation, tuned

for the particular domain. Hence, for each typ e of link we

performed a separate computation of each link-analysis al-

gorithm. We computed the hubs and authorities scores (as

in HITS algorithm [22]), and the PageRank scores [25]. In

Section 4 we discuss the speciﬁc relationships and node types

developed for community question/answering.

3.3 Usage statistics

Readers of the content (who may or may not also b e con-

tributors) provide valuable information about the items they

ﬁnd interesting. In particular, usage statistics such as the

number of clicks on the item and dwell time have been shown

useful in the context of identifying high quality web search

results, and are complementary to link-analysis based meth-

ods. Intuitively, usage statistics measures are useful for so-

cial media content, but require diﬀerent interpretation from

the previously studied settings.

For example, all items within a popular category such as

celebrity images or popular culture topics may receive orders

of magnitude more clicks than, for instance, science topics.

Nevertheless, when normalized by the item category, the de-

viation from expected number of clicks can be used to infer

quality directly, or can be incorporated into the classiﬁca-

tion framework. The speciﬁc usage statistics that we use are

described in Section 4.3.

3.4 Overall classication framework

We cast the problem of quality ranking as a binary classiﬁ-

cation problem, in which a system must learn automatically

to separate high-quality content from the rest.

We experimented with several classiﬁcation algorithms,

including those reported to achieve good performance with

text classiﬁcation tasks, such as support vector machines

and log-linear classiﬁers; the best performance among the

techniques we tested was obtained with stochastic gradient

评论收藏

内容反馈

jy_game_over

粉丝: 52
资源: 10

基于排序学习技术的商品评论排序系统一些论文

论文研究-基于用户评论的混合过滤排序学习推荐系统 .pdf

毕业设计，基于机器学习的商品评论分析系统.zip

基于机器学习的商品评论分析系统+源代码+文档说明

本科毕业设计，基于机器学习的商品评论分析系统.zip

论文研究-一种基于排序学习的专家查找算法 .pdf

论文研究 - 基于数组排序的堆排序

论文研究-基于重排序融合的社会图书检索系统.pdf

论文研究-基于自然邻居流形排序图像检索技术研究.pdf

基于python深度学习的电影评论情感分析系统源码数据库论文.docx

论文研究-一种基于结构化学习的排序算法.pdf

论文研究-基于模型融合排序学习算法研究 .pdf

论文研究-基于排序优化选择的蚁群算法研究 .pdf

基于SSM的个人博客系统毕业论文文档+软件源码+视频说明.zip

论文研究-基于LDA模型的可信社区服务评论排序 .pdf

论文研究-基于权重集合的决策单元排序方法.pdf

论文研究-基于Hadoop的多关键字排序方法研究.pdf

论文研究-基于FQFD的固体火箭发动机技术特性综合排序方法.pdf

论文研究-基于多关键词排序的动态数据密文检索方案 .pdf

论文研究-基于图的流行排序的显著目标检测改进算法.pdf

论文研究-基于交叉效率DEA和随机模拟的区间语言偏好关系排序方法.pdf

最新资源