clucene-corelucenec资源-CSDN文库

共414个文件

cpp：132个

h：132个

am：26个

clucene

lucene

需积分: 10 27 浏览量 2009-09-16 10:08:52 上传评论收藏 2.04MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

clucene-core lucene c （414个子文件）

configure.ac 8KB

Makefile.am 3KB

Makefile.am 2KB

Makefile.am 1KB

Makefile.am 805B

Makefile.am 676B

Makefile.am 662B

Makefile.am 583B

Makefile.am 491B

Makefile.am 446B

Makefile.am 425B

Makefile.am 378B

Makefile.am 358B

Makefile.am 332B

Makefile.am 331B

Makefile.am 325B

Makefile.am 316B

Makefile.am 247B

Makefile.am 237B

Makefile.am 232B

Makefile.am 215B

Makefile.am 186B

Makefile.am 163B

Makefile.am 162B

AUTHORS 1KB

french_unicode.bin 5KB

ChangeLog 0B

clucene-config.h.cmake 10KB

configure 865KB

COPYING 3KB

SegmentReader.cpp 26KB

SegmentMerger.cpp 23KB

IndexReader.cpp 22KB

IndexWriter.cpp 22KB

TestSort.cpp 22KB

FSDirectory.cpp 20KB

MultiReader.cpp 20KB

DocumentWriter.cpp 19KB

FieldCacheImpl.cpp 17KB

StandardTokenizer.cpp 16KB

TermInfosReader.cpp 15KB

CuTest.cpp 14KB

QueryParser.cpp 14KB

PhraseQuery.cpp 13KB

StringBuffer.cpp 13KB

TestAnalyzers.cpp 13KB

SegmentTermEnum.cpp 13KB

BooleanQuery.cpp 12KB

FuzzyQuery.cpp 11KB

TestQueryParser.cpp 11KB

RAMDirectory.cpp 11KB

memtracking.cpp 11KB

QueryParserBase.cpp 11KB

TermVectorReader.cpp 11KB

TestSearch.cpp 11KB

gunichartables.cpp 11KB

IndexSearcher.cpp 11KB

Lexer.cpp 10KB

CompoundFile.cpp 10KB

MD5Digester.cpp 10KB

Field.cpp 10KB

TermVectorWriter.cpp 10KB

Analyzers.cpp 9KB

TestTermVector.cpp 9KB

TestReuters.cpp 8KB

TransactionalRAMDirectory.cpp 8KB

SegmentInfos.cpp 8KB

Similarity.cpp 8KB

FieldSortedHitQueue.cpp 8KB

PrefixQuery.cpp 8KB

MultiSearcher.cpp 8KB

Misc.cpp 8KB

Sort.cpp 7KB

RangeQuery.cpp 7KB

Term.cpp 7KB

testall.cpp 7KB

Document.cpp 7KB

FieldsReader.cpp 7KB

IndexModifier.cpp 7KB

BooleanScorer.cpp 6KB

TermQuery.cpp 6KB

FieldInfos.cpp 6KB

PhraseScorer.cpp 6KB

utf8.cpp 6KB

IndexInput.cpp 6KB

TestDateFilter.cpp 6KB

FieldDocSortedHitQueue.cpp 6KB

MultiFieldQueryParser.cpp 6KB

FieldsWriter.cpp 6KB

TermInfosWriter.cpp 6KB

SegmentTermDocs.cpp 6KB

ChainedFilter.cpp 5KB

TestUtf8.cpp 5KB

Reader.cpp 5KB

MMapInput.cpp 5KB

TestWildcard.cpp 5KB

WildcardTermEnum.cpp 5KB

共 414 条

Reuters-21578 text categorization test collection Distribution 1.0 README file (v 1.2) 26 September 1997 David D. Lewis AT&T Labs - Research lewis@research.att.com I. Introduction This README describes Distribution 1.0 of the Reuters-21578 text categorization test collection, a resource for research in information retrieval, machine learning, and other corpus-based research. II. Copyright & Notification The copyright for the text of newswire articles and Reuters annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data *for research purposes only*. If you publish results based on this data set, please acknowledge its use, refer to the data set by the name "Reuters-21578, Distribution 1.0", and inform your readers of the current location of the data set (see "Availability & Questions"). III. Availability & Questions The Reuters-21578, Distribution 1.0 test collection is available from David D. Lewis' professional home page, currently: http://www.research.att.com/~lewis Besides this README file, the collection consists of 22 data files, an SGML DTD file describing the data file format, and six files describing the categories used to index the data. (See Sections VI and VII for more details.) Some additional files, which are not part of the collection but have been contributed by other researchers as useful resources are also included. All files are available uncompressed, and in addition a single gzipped Unix tar archive of the entire distribution is available as reuters21578.tar.gz. The text categorization mailing list, DDLBETA, is a good place to send questions about this collection and other text categorization issues. You may join the list by writing David Lewis at lewis@research.att.com. IV. History & Acknowledgements The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987. The documents were assembled and indexed with categories by personnel from Reuters Ltd. (Sam Dobbins, Mike Topliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen, Monica Cellio, Phil Hayes, Laura Knecht, Irene Nirenburg) in 1987. In 1990, the documents were made available by Reuters and CGI for research purposes to the Information Retrieval Laboratory (W. Bruce Croft, Director) of the Computer and Information Science Department at the University of Massachusetts at Amherst. Formatting of the documents and production of associated data files was done in 1990 by David D. Lewis and Stephen Harding at the Information Retrieval Laboratory. Further formatting and data file production was done in 1991 and 1992 by David D. Lewis and Peter Shoemaker at the Center for Information and Language Studies, University of Chicago. This version of the data was made available for anonymous FTP as "Reuters-22173, Distribution 1.0" in January 1993. From 1993 through 1996, Distribution 1.0 was hosted at a succession of FTP sites maintained by the Center for Intelligent Information Retrieval (W. Bruce Croft, Director) of the Computer Science Department at the University of Massachusetts at Amherst. At the ACM SIGIR '96 conference in August, 1996 a group of text categorization researchers discussed how published results on Reuters-22173 could be made more comparable across studies. It was decided that a new version of collection should be produced with less ambiguous formatting, and including documentation carefully spelling out standard methods of using the collection. The opportunity would also be used to correct a variety of typographical and other errors in the categorization and formatting of the collection. Steve Finch and David D. Lewis did this cleanup of the collection September through November of 1996, relying heavily on Finch's SGML-tagged version of the collection from an earlier study. One result of the re-examination of the collection was the removal of 595 documents which were exact duplicates (based on identity of timestamps down to the second) of other documents in the collection. The new collection therefore has only 21,578 documents, and thus is called the Reuters-21578 collection. This README describes version 1.0 of this new collection, which we refer to as "Reuters-21578, Distribution 1.0". In preparing the collection and documentation we have benefited from discussions with Eric Brown, William Cohen, Fred Damerau, Yoram Singer, Amit Singhal, and Yiming Yang, among many others. We thank all the people and organizations listed above for their efforts and support, without which this collection would not exist. A variety of other changes were also made in going from Reuters-22173 to Reuters-21578: 1. Documents were marked up with SGML tags, and a corresponding SGML DTD was produced, so that the boundaries of important sections of documents (e.g. category fields) are unambiguous. 2. The set of categories that are legal for each of the five controlled vocabulary fields was specified. All category names not legal for a field were corrected to a legal category, moved to their appropriate field, or removed, as appropriate. 3. Documents were given new ID numbers, in chronological order, and are collected 1000 to a file in order by ID (and therefore in order chronologically). V. What is a Text Categorization Test Collection and Who Cares? *Text categorization* is the task of deciding whether a piece of text belongs to any of a set of prespecified categories. It is a generic text processing task useful in indexing documents for later retrieval, as a stage in natural language processing systems, for content analysis, and in many other roles [LEWIS94d]. The use of standard, widely distributed test collections has been a considerable aid in the development of algorithms for the related task of *text retrieval* (finding documents that satisfy a particular user's information need, usually expressed in an textual request). Text retrieval test collections have allowed the comparison of algorithms developed by a variety of researchers around the world. (For more on text retrieval test collections see SPARCKJONES76.) Standard test collections have been lacking, however, for text categorization. Few data sets have been used by more than one researcher, making results hard to compare. The Reuters-22173 test collection has been used in a number of published studies since it was made available, and we believe that the Reuters-21578 collection will be even more valuable. The collection may also be of interest to researchers in machine learning, as it provides a classification task with challenging properties. There are multiple categories, the categories are overlapping and nonexhaustive, and there are relationships among the categories. There are interesting possibilities for the use of domain knowledge. There are many possible feature sets that can be extracted from the text, and most plausible feature/example matrices are large and sparse. There is even some temporal structure to the data [LEWIS94b], though problems with the indexing and the uneven distribution of stories within the timespan covered may make this collection a poor one to explore temporal issues. VI. Formatting The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contain 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Rather than going into the details of the SGML language, we describe here in an informal way how the SGML tags are used t

评论收藏

内容反馈