Inductive Learning Algorithms and Representations for
Text Categorization
Susan Dumais
Microsoft Research
One Microsoft Way
Redmond, WA 98052
sdumais@microsoft.com
John Platt
Microsoft Research
One Microsoft Way
Redmond, WA 98052
jplatt@microsoft.com
Mehran Sahami
Computer Science Department
Stanford University
Stanford, CA 94305-9010
sahami@cs.stanford.edu
David Heckerman
Microsoft Research
One Microsoft Way
Redmond, WA 98052
heckerma@microsoft.com
1. ABSTRACT
Text categorization – the assignment of natural
language texts to one or more predefined
categories based on their content – is an
important component in many information
organization and management tasks. We
compare the effectiveness of five different
automatic learning algorithms for text
categorization in terms of learning speed, real-
time classification speed, and classification
accuracy. We also examine training set size,
and alternative document representations.
Very accurate text classifiers can be learned
automatically from training examples. Linear
Support Vector Machines (SVMs) are
particularly promising because they are very
accurate, quick to train, and quick to evaluate.
1.1 Keywords
Text categorization, classification, support vector machines,
machine learning, information management.
2. INTRODUCTION
As the volume of information available on the Internet and
corporate intranets continues to increase, there is growing
interest in helping people better find, filter, and manage
these resources. Text categorization – the assignment of
natural language texts to one or more predefined categories
based on their content – is an important component in many
information organization and management tasks. Its most
widespread application to date has been for assigning
subject categories to documents to support text retrieval,
routing and filtering.
Automatic text categorization can play an important role in
a wide variety of more flexible, dynamic and personalized
information management tasks as well: real-time sorting of
email or files into folder hierarchies; topic identification to
support topic-specific processing operations; structured
search and/or browsing; or finding documents that match
long-term standing interests or more dynamic task-based
interests. Classification technologies should be able to
support category structures that are very general, consistent
across individuals, and relatively static (e.g., Dewey
Decimal or Library of Congress classification systems,
Medical Subject Headings (MeSH), or Yahoo!’s topic
hierarchy), as well as those that are more dynamic and
customized to individual interests or tasks (e.g., email about
the CIKM conference).
In many contexts (Dewey, MeSH, Yahoo!, CyberPatrol),
trained professionals are employed to categorize new items.
This process is very time-consuming and costly, thus
limiting its applicability. Consequently there is increased
interest in developing technologies for automatic text
categorization. Rule-based approaches similar to those
used in expert systems are common (e.g., Hayes and
评论0
最新资源