Reuters-21578 text categorization test collection
Distribution 1.0
README file (v 1.2)
26 September 1997
David D. Lewis
AT&T Labs - Research
lewis@research.att.com
I. Introduction
This README describes Distribution 1.0 of the Reuters-21578 text
categorization test collection, a resource for research in information
retrieval, machine learning, and other corpus-based research.
II. Copyright & Notification
The copyright for the text of newswire articles and Reuters
annotations in the Reuters-21578 collection resides with Reuters Ltd.
Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free
distribution of this data *for research purposes only*.
If you publish results based on this data set, please acknowledge
its use, refer to the data set by the name "Reuters-21578,
Distribution 1.0", and inform your readers of the current location of
the data set (see "Availability & Questions").
III. Availability & Questions
The Reuters-21578, Distribution 1.0 test collection is available
from David D. Lewis' professional home page, currently:
http://www.research.att.com/~lewis
Besides this README file, the collection consists of 22 data files, an
SGML DTD file describing the data file format, and six files
describing the categories used to index the data. (See Sections VI
and VII for more details.) Some additional files, which are not part
of the collection but have been contributed by other researchers as
useful resources are also included. All files are available
uncompressed, and in addition a single gzipped Unix tar archive of the
entire distribution is available as reuters21578.tar.gz.
The text categorization mailing list, DDLBETA, is a good place to
send questions about this collection and other text categorization
issues. You may join the list by writing David Lewis at
lewis@research.att.com.
IV. History & Acknowledgements
The documents in the Reuters-21578 collection appeared on the
Reuters newswire in 1987. The documents were assembled and indexed
with categories by personnel from Reuters Ltd. (Sam Dobbins, Mike
Topliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen,
Monica Cellio, Phil Hayes, Laura Knecht, Irene Nirenburg) in 1987.
In 1990, the documents were made available by Reuters and CGI for
research purposes to the Information Retrieval Laboratory (W. Bruce
Croft, Director) of the Computer and Information Science Department at
the University of Massachusetts at Amherst. Formatting of the
documents and production of associated data files was done in 1990 by
David D. Lewis and Stephen Harding at the Information Retrieval
Laboratory.
Further formatting and data file production was done in 1991 and 1992
by David D. Lewis and Peter Shoemaker at the Center for Information
and Language Studies, University of Chicago. This version of the data
was made available for anonymous FTP as "Reuters-22173, Distribution
1.0" in January 1993. From 1993 through 1996, Distribution 1.0 was
hosted at a succession of FTP sites maintained by the Center for
Intelligent Information Retrieval (W. Bruce Croft, Director) of the
Computer Science Department at the University of Massachusetts at
Amherst.
At the ACM SIGIR '96 conference in August, 1996 a group of text
categorization researchers discussed how published results on
Reuters-22173 could be made more comparable across studies. It was
decided that a new version of collection should be produced with less
ambiguous formatting, and including documentation carefully spelling
out standard methods of using the collection. The opportunity would
also be used to correct a variety of typographical and other errors in
the categorization and formatting of the collection.
Steve Finch and David D. Lewis did this cleanup of the collection
September through November of 1996, relying heavily on Finch's
SGML-tagged version of the collection from an earlier study. One
result of the re-examination of the collection was the removal of 595
documents which were exact duplicates (based on identity of timestamps
down to the second) of other documents in the collection. The new
collection therefore has only 21,578 documents, and thus is called the
Reuters-21578 collection. This README describes version 1.0 of this
new collection, which we refer to as "Reuters-21578, Distribution
1.0".
In preparing the collection and documentation we have benefited from
discussions with Eric Brown, William Cohen, Fred Damerau, Yoram
Singer, Amit Singhal, and Yiming Yang, among many others.
We thank all the people and organizations listed above for their
efforts and support, without which this collection would not exist.
A variety of other changes were also made in going from Reuters-22173
to Reuters-21578:
1. Documents were marked up with SGML tags, and a corresponding
SGML DTD was produced, so that the boundaries of important sections of
documents (e.g. category fields) are unambiguous.
2. The set of categories that are legal for each of the five
controlled vocabulary fields was specified. All category names not
legal for a field were corrected to a legal category, moved to their
appropriate field, or removed, as appropriate.
3. Documents were given new ID numbers, in chronological order, and
are collected 1000 to a file in order by ID (and therefore in order
chronologically).
V. What is a Text Categorization Test Collection and Who Cares?
*Text categorization* is the task of deciding whether a piece of
text belongs to any of a set of prespecified categories. It is a
generic text processing task useful in indexing documents for later
retrieval, as a stage in natural language processing systems, for
content analysis, and in many other roles [LEWIS94d].
The use of standard, widely distributed test collections has been a
considerable aid in the development of algorithms for the related task
of *text retrieval* (finding documents that satisfy a particular
user's information need, usually expressed in an textual request).
Text retrieval test collections have allowed the comparison of
algorithms developed by a variety of researchers around the world.
(For more on text retrieval test collections see SPARCKJONES76.)
Standard test collections have been lacking, however, for text
categorization. Few data sets have been used by more than one
researcher, making results hard to compare. The Reuters-22173 test
collection has been used in a number of published studies since it was
made available, and we believe that the Reuters-21578 collection will
be even more valuable.
The collection may also be of interest to researchers in machine
learning, as it provides a classification task with challenging
properties. There are multiple categories, the categories are
overlapping and nonexhaustive, and there are relationships among the
categories. There are interesting possibilities for the use of domain
knowledge. There are many possible feature sets that can be extracted
from the text, and most plausible feature/example matrices are large
and sparse. There is even some temporal structure to the data
[LEWIS94b], though problems with the indexing and the uneven
distribution of stories within the timespan covered may make this
collection a poor one to explore temporal issues.
VI. Formatting
The Reuters-21578 collection is distributed in 22 files. Each of
the first 21 files (reut2-000.sgm through reut2-020.sgm) contain 1000
documents, while the last (reut2-021.sgm) contains 578 documents.
The files are in SGML format. Rather than going into the details
of the SGML language, we describe here in an informal way how the SGML
tags are used t
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Lucene是apache组织的一个用java实现全文搜索引擎的开源项目。 其功能非常的强大,api也很简单。总得来说用Lucene来进行建立 和搜索和操作数据库是差不多的(有点像),Document可以看作是 数据库的一行记录,Field可以看作是数据库的字段。用lucene实 现搜索引擎就像用JDBC实现连接数据库一样简单。
资源推荐
资源详情
资源评论
收起资源包目录
clucene-core lucene c (414个子文件)
configure.ac 8KB
Makefile.am 3KB
Makefile.am 2KB
Makefile.am 2KB
Makefile.am 1KB
Makefile.am 1KB
Makefile.am 805B
Makefile.am 676B
Makefile.am 662B
Makefile.am 583B
Makefile.am 491B
Makefile.am 446B
Makefile.am 425B
Makefile.am 378B
Makefile.am 358B
Makefile.am 332B
Makefile.am 331B
Makefile.am 325B
Makefile.am 316B
Makefile.am 247B
Makefile.am 237B
Makefile.am 232B
Makefile.am 215B
Makefile.am 186B
Makefile.am 163B
Makefile.am 162B
Makefile.am 162B
AUTHORS 1KB
french_unicode.bin 5KB
ChangeLog 0B
clucene-config.h.cmake 10KB
configure 865KB
COPYING 3KB
SegmentReader.cpp 26KB
SegmentMerger.cpp 23KB
IndexReader.cpp 22KB
IndexWriter.cpp 22KB
TestSort.cpp 22KB
FSDirectory.cpp 20KB
MultiReader.cpp 20KB
DocumentWriter.cpp 19KB
FieldCacheImpl.cpp 17KB
StandardTokenizer.cpp 16KB
TermInfosReader.cpp 15KB
CuTest.cpp 14KB
QueryParser.cpp 14KB
PhraseQuery.cpp 13KB
StringBuffer.cpp 13KB
TestAnalyzers.cpp 13KB
SegmentTermEnum.cpp 13KB
BooleanQuery.cpp 12KB
FuzzyQuery.cpp 11KB
TestQueryParser.cpp 11KB
RAMDirectory.cpp 11KB
memtracking.cpp 11KB
QueryParserBase.cpp 11KB
TermVectorReader.cpp 11KB
TestSearch.cpp 11KB
gunichartables.cpp 11KB
IndexSearcher.cpp 11KB
Lexer.cpp 10KB
CompoundFile.cpp 10KB
MD5Digester.cpp 10KB
Field.cpp 10KB
TermVectorWriter.cpp 10KB
Analyzers.cpp 9KB
TestTermVector.cpp 9KB
TestReuters.cpp 8KB
TransactionalRAMDirectory.cpp 8KB
SegmentInfos.cpp 8KB
Similarity.cpp 8KB
FieldSortedHitQueue.cpp 8KB
PrefixQuery.cpp 8KB
MultiSearcher.cpp 8KB
Misc.cpp 8KB
Sort.cpp 7KB
RangeQuery.cpp 7KB
Term.cpp 7KB
testall.cpp 7KB
Document.cpp 7KB
FieldsReader.cpp 7KB
IndexModifier.cpp 7KB
BooleanScorer.cpp 6KB
TermQuery.cpp 6KB
FieldInfos.cpp 6KB
PhraseScorer.cpp 6KB
utf8.cpp 6KB
IndexInput.cpp 6KB
TestDateFilter.cpp 6KB
FieldDocSortedHitQueue.cpp 6KB
MultiFieldQueryParser.cpp 6KB
FieldsWriter.cpp 6KB
TermInfosWriter.cpp 6KB
SegmentTermDocs.cpp 6KB
ChainedFilter.cpp 5KB
TestUtf8.cpp 5KB
Reader.cpp 5KB
MMapInput.cpp 5KB
TestWildcard.cpp 5KB
WildcardTermEnum.cpp 5KB
共 414 条
- 1
- 2
- 3
- 4
- 5
资源评论
gbobo1
- 粉丝: 17
- 资源: 18
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 可直接运行 MATLAB数学建模学习资料 模拟算法MATLAB代码实现.rar
- 基于 Java+SQLServer 实现的医药售卖系统课程设计
- HCNP(HCDP)华为认证资深网络工程师-路由交换方向培训 -IESN中文理论书-内文.pdf
- 新版FPGA课程大纲,芯片硬件开发用的大纲
- ROS2下OpenCV识别物体区域和视频捕捉的样例
- STM32-EMBPI.PDF
- Font Awesome图标字体库提供可缩放矢量图标,它可以被定制大小、颜色、阴影以及任何可以用CSS的样式
- Bluefield 2固件镜像版本,fw-MBF2M345A-VENOT-ES-Ax-24.40.1000.bin
- 雪颜奇迹幻白双重莹白焕采霜50ML-1016-FA.rar
- Qt的QDOCK高级用法源码,包含linux和windows版本,从开源库下载
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功