Finding High-Quality Content in Social Media
Eugene Agichtein
Emory University
Atlanta, USA
eugene@mathcs.emory.edu
Carlos Castillo
Yahoo! Research
Barcelona, Spain
chato@yahoo-inc.com
Debora Donato
Yahoo! Research
Barcelona, Spain
debora@yahoo-inc.com
Aristides Gionis
Yahoo! Research
Barcelona, Spain
gionis@yahoo-inc.com
Gilad Mishne
Search and Advertising
Sciences, Yahoo!
gilad@yahoo-inc.com
ABSTRACT
The quality of user-generated content varies drastically from
excellent to abuse and spam. As the availability of such con-
tent increases, the task of identifying high-quality content
in sites based on user contributions—social media sites—
becomes increasingly important. Social media in general
exhibit a rich variety of information sources: in addition to
the content itself, there is a wide array of non-content infor-
mation available, such as links between items and explicit
quality ratings from members of the community. In this pa-
per we investigate methods for exploiting such community
feedback to automatically identify high quality content. As
a test case, we focus on Yahoo! Answers, a large community
question/answering portal that is particularly rich in the
amount and types of content and social interactions avail-
able in it. We introduce a general classification framework
for combining the evidence from different sources of infor-
mation, that can be tuned automatically for a given social
media type and quality definition. In particular, for the
community question/answering domain, we show that our
system is able to separate high-quality items from the rest
with an accuracy close to that of humans.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Con-
tent Analysis and Indexing – indexing methods, linguistic
processing; H.3.3 Information Search and Retrieval – infor-
mation filtering, search process.
General Terms
Algorithms, Design, Experimentation.
Keywords
Social media, Community Question Answering, User Inter-
actions.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specic
permission and/or a fee.
WSDM'08,
February 1112, 2008, Palo Alto, California, USA.
Copyright 2008 ACM 978-1-59593-927-9/08/0002 ...$5.00.
1. INTRODUCTION
Recent years have seen a transformation in the type of
content available on the web. During the first decade of the
web’s prominence—from the early 1990s onwards—most on-
line content resembled traditional published material: the
majority of web users were consumers of content, created
by a relatively small amount of publishers. From the early
2000s, user-generated content has become increasingly pop-
ular on the web: more and more users participate in con-
tent creation, rather than just consumption. Popular user-
generated content (or social media) domains include blogs
and web forums, social bookmarking sites, photo and video
sharing communities, as well as social networking platforms
such as Facebook and MySpace, which offers a combination
of all of these with an emphasis on the relationships among
the users of the community.
Community-driven question/answering portals are a par-
ticular form of user-generated content that is gaining a large
audience in recent years. These portals, in which users an-
swer questions posed by other users, provide an alternative
channel for obtaining information on the web: rather than
browsing results of search engines, users present detailed in-
formation needs—and get direct responses authored by hu-
mans. In some markets, this information seeking behavior
is dominating over traditional web search [29].
An important difference between user-generated content
and traditional content that is particularly significant for
knowledge-based media such as question/answering portals
is the variance in the quality of the content. As Ander-
son [3] describes, in traditional publishing—mediated by a
publisher—the typical range of quality is substantially nar-
rower than in niche, unmediated markets. The main chal-
lenge posed by content in social media sites is the fact that
the distribution of quality has high variance: from very
high-quality items to low-quality, sometimes abusive con-
tent. This makes the tasks of filtering and ranking in such
systems more complex than in other domains. However, for
information-retrieval tasks, social media systems present in-
herent advantages over traditional collections of documents:
their rich structure offers more available data than in other
domains. In addition to document content and link struc-
ture, social media exhibit a wide variety of user-to-document
relation types, and user-to-user interactions.
In this paper we address the task of identifying high-
quality content in community-driven question/answering sites,
exploring the benefits of having additional sources of infor-