ABOUT ACM BOOKS
ACM Books is a new series of high quality books for
the computer science community, published by ACM
in collaboration with Morgan & Claypool Publishers.
ACM Books publications are widely distributed in
both print and digital formats through booksellers
and to libraries (and library consortia) and individual ACM members via the ACM
Digital Library platform.
BOOKS.ACM.ORG • WWW.MORGANCLAYPOOL.COM
ACM | MORGAN & CLAYPOOL
M
C
&
M
C
&
Text Data Management and Analysis
ZHAI • MASSUNG
ChengXiang Zhai
Sean Massung
Text Data
Management
and Analysis
A Practical Introduction
to Information Retrieval
and Text Mining
ISBN: 978-1-97000-116-7
9 00 00
Recent years have seen a dramatic growth of natural language text data, including web pages,
news articles, scientific literature, emails, enterprise documents, and social media such as
blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand
for powerful software tools to help people manage and analyze vast amounts of text data ef-
fectively and efficiently. Unlike data generated by a computer system or sensors, text data are
usually generated directly by humans, and capture semantically rich content. As such, text
data are especially valuable for discovering knowledge about human opinions and preferenc-
es, in addition to many other kinds of knowledge that we encode in text. In contrast to struc-
tured data, which conform to well-defined schemas (thus are relatively easy for computers to
handle), text has less explicit structure, requiring computer processing toward understanding
of the content encoded in text. The current technology of natural language processing has
not yet reached a point to enable a computer to precisely understand natural language text,
but a wide range of statistical and heuristic approaches to management and analysis of text
data have been developed over the past few decades. They are usually very robust and can be
applied to analyze and manage text data in any natural language, and about any topic.
This book provides a systematic introduction to many of these approaches, with an em-
phasis on covering the most useful knowledge and skills required to build a variety of prac-
tically useful text information systems. Because humans can understand natural languages
far better than computers can, effective involvement of humans in a text information system
is generally needed and text information systems often serve as intelligent assistants for hu-
mans. Depending on how a text information system collaborates with humans, we distinguish
two kinds of text information systems. The first is information retrieval systems which include
search engines and recommender systems; they assist users in finding from a large collection
of text data the most relevant text data that are actually needed for solving a specific applica-
tion problem, thus effectively turning big raw text data into much smaller relevant text data
that can be more easily processed by humans. The second is text mining application systems;
they can assist users in analyzing patterns in text data to extract and discover useful action-
able knowledge directly useful for task completion or decision making, thus providing more
direct task support for users.
Text Data Management and Analysis
A Practical Introduction to Information Retrieval and Text Mining
ChengXiang Zhai and Sean Massung