paoding_analysis.rar_PaodingAnalysis_lucenepaoding

共915个文件

class：533个

java：337个

html：24个

版权申诉

116 浏览量 2022-09-14 16:23:51 上传评论收藏 2.77MB RAR 举报

《 PaodingAnalysis与Lucene中文索引技术解析》在当今大数据时代，高效的信息检索成为企业和个人处理海量信息的关键。Lucene，作为一个强大的全文搜索引擎库，被广泛应用在各种信息检索系统中。而针对中文处理，PaodingAnalysis（又称Paoding分词器）则扮演了重要的角色，它为Lucene提供了精准的中文分词能力，使得中文信息检索更加准确和高效。本文将深入探讨PaodingAnalysis与Lucene结合使用的相关知识点。我们需要理解Lucene的基本原理。Lucene是一个开源的全文检索库，它提供了一套完整的搜索解决方案，包括索引创建、文档存储、查询解析、搜索结果排名等。其核心在于将文本数据转换为倒排索引，以便快速定位到包含特定关键词的文档。在英文环境中，由于单词边界清晰，Lucene的分词效果通常较好。然而，中文分词的复杂性对搜索引擎提出了挑战，这就需要像Paoding这样的专业分词工具。 PaodingAnalysis是专为Lucene设计的高性能、高精度的中文分词器。它支持多种分词模式，如精确模式、全模式、搜索引擎模式等，以适应不同的应用场景。精确模式适合于对结果准确性要求较高的场合，如搜索引擎的关键词分析；全模式则尽可能切分出所有可能的词语，适用于新闻正文等大量文本的处理；而搜索引擎模式则在精确度和速度之间找到了平衡，适用于实时查询。在PaodingAnalysis与Lucene的集成过程中，开发者首先需要引入PaodingAnalysis的相关依赖，然后在配置Lucene的Analyzer时选择PaodingAnalyzer。这样，当索引或搜索中文文本时，Paoding分词器就会自动对文本进行分词处理，生成适合搜索引擎查询的分词结果。通过这种方式，PaodingAnalysis显著提升了Lucene处理中文的能力，使搜索结果更加准确。在实际应用中，"paoding_analysis.rar"这个压缩包很可能包含了实现这一功能所需的全部资源和配置文件，例如分词词典、样例代码以及相关的文档说明。文件名中的"lucene paoding paodi"标签，暗示了这是关于Lucene使用Paoding分词器进行中文索引的示例或者库文件。开发者可以解压这个文件，参考其中的代码示例，学习如何在自己的项目中集成并使用PaodingAnalysis。 PaodingAnalysis为Lucene添加了强大的中文分词功能，使得基于Lucene的中文信息检索系统得以提升。通过理解和掌握这两个工具的结合使用，开发者可以构建出更加智能和高效的中文搜索应用，服务于各类信息检索需求。

资源推荐

资源详情

资源评论

收起资源包目录

paoding_analysis.rar_PaodingAnalysis_lucene paoding_paodi （915个子文件）

IndexWriter.class 65KB

QueryParser.class 29KB

DocumentsWriter.class 28KB

QueryParserKey.class 27KB

SegmentReader.class 24KB

SegmentMerger.class 18KB

CheckIndex.class 17KB

QueryParserTokenManager.class 16KB

IndexReader.class 14KB

FSDirectory.class 12KB

FieldsReader.class 12KB

MultiSegmentReader.class 12KB

ParallelReader.class 11KB

SegmentInfo.class 11KB

TermsHashPerField.class 11KB

TermVectorsReader.class 11KB

IndexFileDeleter.class 11KB

FreqProxTermsWriter.class 10KB

SegmentInfos.class 10KB

Token.class 10KB

OpenBitSet.class 10KB

BitUtil.class 10KB

StandardTokenizerImpl.class 10KB

DirectoryIndexReader.class 10KB

FieldSortedHitQueue.class 9KB

MultiReader.class 8KB

TermVectorsTermsWriter.class 8KB

DocFieldProcessorPerThread.class 8KB

PorterStemmer.class 8KB

LogMergePolicy.class 8KB

TermInfosReader.class 7KB

MultiSearcher.class 7KB

TermVectorsTermsWriterPerField.class 7KB

FieldInfos.class 7KB

TermsHash.class 7KB

BooleanScorer2.class 7KB

FieldsWriter.class 7KB

IndexModifier.class 7KB

ConcurrentMergeScheduler.class 7KB

NearSpansOrdered.class 7KB

Search.class 6KB

BooleanQuery.class 6KB

TermInfosWriter.class 6KB

NearSpansUnordered.class 6KB

CustomScoreQuery.class 6KB

StoredFieldsWriter.class 6KB

TermVectorsWriter.class 6KB

Field.class 6KB

AbstractField.class 6KB

MultiPhraseQuery$MultiPhraseWeight.class 6KB

MultiFieldQueryParser.class 6KB

DocInverterPerField.class 6KB

NormsWriter.class 6KB

PhraseQuery$PhraseWeight.class 6KB

RangeFilter.class 5KB

MultiPhraseQuery.class 5KB

Hits.class 5KB

BooleanQuery$BooleanWeight.class 5KB

CompoundFileWriter.class 5KB

IndexSearcher.class 5KB

BitVector.class 5KB

SegmentTermDocs.class 5KB

TermsHashPerThread.class 5KB

SpanWeight.class 5KB

FieldCacheImpl.class 5KB

RAMDirectory.class 5KB

RangeQuery.class 5KB

DateTools.class 5KB

Document.class 5KB

PayloadSpanUtil.class 5KB

FilterIndexReader.class 5KB

SpanNearQuery.class 5KB

BooleanScorer.class 5KB

PhraseQuery.class 5KB

ParallelMultiSearcher.class 5KB

StandardAnalyzer.class 5KB

Searcher.class 5KB

TermQuery$TermWeight.class 5KB

UnicodeUtil.class 5KB

CharArraySet.class 5KB

FuzzyQuery.class 5KB

SegmentInfos$FindSegmentsFile.class 5KB

DocFieldConsumers.class 5KB

CompoundFileReader.class 5KB

Query.class 4KB

FuzzyTermEnum.class 4KB

FreqProxTermsWriterPerField.class 4KB

SegmentTermEnum.class 4KB

SpanOrQuery.class 4KB

FieldsReader$LazyField.class 4KB

NativeFSLock.class 4KB

DisjunctionSumScorer.class 4KB

OpenBitSetIterator.class 4KB

ItemLocationIndex.class 4KB

TermScorer.class 4KB

QueryTermVector.class 4KB

BoostingTermQuery$BoostingTermWeight$BoostingSpanScorer.class 4KB

DisjunctionMaxQuery$DisjunctionMaxWeight.class 4KB

CustomScoreQuery$CustomScorer.class 4KB

LockStressTest.class 4KB

共 915 条

package org.apache.lucene.index; /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.document.Document; import org.apache.lucene.search.Similarity; import org.apache.lucene.search.Query; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.Lock; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.store.AlreadyClosedException; import org.apache.lucene.util.BitVector; import org.apache.lucene.util.Constants; import java.io.File; import java.io.IOException; import java.io.PrintStream; import java.util.List; import java.util.Collection; import java.util.ArrayList; import java.util.HashMap; import java.util.Set; import java.util.HashSet; import java.util.LinkedList; import java.util.Iterator; /** An <code>IndexWriter</code> creates and maintains an index. The <code>create</code> argument to the <a href="#IndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.analysis.Analyzer, boolean)">constructor</a> determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with <code>create=true</code> even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. There are also <a href="#IndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.analysis.Analyzer)">constructors</a> with no <code>create</code> argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index. In either case, documents are added with <a href="#addDocument(org.apache.lucene.document.Document)">addDocument</a> and removed with <a href="#deleteDocuments(org.apache.lucene.index.Term)">deleteDocuments(Term)</a> or <a href="#deleteDocuments(org.apache.lucene.search.Query)">deleteDocuments(Query)</a>. A document can be updated with <a href="#updateDocument(org.apache.lucene.index.Term, org.apache.lucene.document.Document)">updateDocument</a> (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, <a href="#close()">close</a> should be called. <a name="flush"></a> These changes are buffered in memory and periodically flushed to the {@link Directory} (during the above method calls). A flush is triggered when there are enough buffered deletes (see {@link #setMaxBufferedDeleteTerms}) or enough added documents since the last flush, whichever is sooner. For the added documents, flushing is triggered either by RAM usage of the documents (see {@link #setRAMBufferSizeMB}) or the number of added documents. The default is to flush when RAM usage hits 16 MB. For best indexing speed you should flush by RAM usage with a large RAM buffer. Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either {@link #commit()} or {@link #close} is called. A flush may also trigger one or more segment merges which by default run with a background thread so as not to block the addDocument calls (see <a href="#mergePolicy">below</a> for changing the {@link MergeScheduler}). <a name="autoCommit"></a> The optional <code>autoCommit</code> argument to the <a href="#IndexWriter(org.apache.lucene.store.Directory, boolean, org.apache.lucene.analysis.Analyzer)">constructors</a> controls visibility of the changes to {@link IndexReader} instances reading the same index. When this is <code>false</code>, changes are not visible until {@link #close()} or {@link #commit()} is called. Note that changes will still be flushed to the {@link org.apache.lucene.store.Directory} as new files, but are not committed (no new <code>segments_N</code> file is written referencing the new files, nor are the files sync'd to stable storage) until {@link #close()} or {@link #commit()} is called. If something goes terribly wrong (for example the JVM crashes), then the index will reflect none of the changes made since the last commit, or the starting state if commit was not called. You can also call {@link #rollback}, which closes the writer without committing any changes, and removes any index files that had been flushed but are now unreferenced. This mode is useful for preventing readers from refreshing at a bad time (for example after you've done all your deletes but before you've done your adds). It can also be used to implement simple single-writer transactional semantics ("all or none"). You can do a two-phase commit by calling {@link #prepareCommit()} followed by {@link #commit()}. This is necessary when Lucene is working with an external resource (for example, a database) and both must either commit or rollback the transaction. When <code>autoCommit</code> is <code>true</code> then the writer will periodically commit on its own. [Deprecated: Note that in 3.0, IndexWriter will no longer accept autoCommit=true (it will be hardwired to false). You can always call {@link #commit()} yourself when needed]. There is no guarantee when exactly an auto commit will occur (it used to be after every flush, but it is now after every completed merge, as of 2.4). If you want to force a commit, call {@link #commit()}, or, close the writer. Once a commit has finished, newly opened {@link IndexReader} instances will see the changes to the index as of that commit. When running in this mode, be careful not to refresh your readers while optimize or segment merges are taking place as this can tie up substantial disk space. Regardless of <code>autoCommit</code>, an {@link IndexReader} or {@link org.apache.lucene.search.IndexSearcher} will only see the index as of the "point in time" that it was opened. Any changes committed to the index after the reader was opened are not visible until the reader is re-opened. If an index will not have more documents added for a while and optimal search performance is desired, then either the full <a href="#optimize()">optimize</a> method or partial {@link #optimize(int)} method should be called before the index is closed. Opening an <code>IndexWriter</code> creates a lock file for the directory in use. Trying to open another <code>IndexWriter</code> on the same directory will lead to a {@link LockObtainFailedException}. The {@link LockObtainFailedException} is also thrown if an IndexReader on the same directory is used to delete documents from the index. <a name="deletionPolicy"></a> Expert: <code>IndexWriter</code> allows an optional {@link IndexDeletionPolicy} implementation to be specified. You can use this to control when prior commits are deleted from the index. The default policy is {@link KeepOnlyLastCommitDeletionPolicy} which removes all prior commits as soon as

评论收藏

内容反馈

版权申诉