PyPI官网下载|text-reuse-retrieve-0.1.10.tar.gz资源-CSDN文库

版权申诉

Python库

190 浏览量 2022-02-02 06:40:36 上传评论收藏 6.91MB GZ 举报

共68个文件

py：36个

js：8个

txt：4个

资源推荐

资源详情

资源评论

收起资源包目录

text-reuse-retrieve-0.1.10.tar.gz （68个子文件）

text-reuse-retrieve-0.1.10

MANIFEST.in 194B

PKG-INFO 420B

test

test_align.py 1014B

test_sparse.py 1KB

test_data.py 866B

test_soft_cosine.py 3KB

test_set.py 2KB

LICENSE 1KB

src

retrieve

utils.py 4KB

pipeline.py 7KB

resources

misc

latin.lemma.ft.dim100.mc2.embeddings.gz 2.98MB

new-testament.books 239B

__init__.py 0B

blb.refs.json 89KB

old-testament.books 382B

texts

__init__.py 0B

vulgate.zip 3.71MB

__init__.py 0B

stop

english.stop 743B

latin_berra.stop 13KB

latin.stop 913B

__init__.py 0B

greek_berra.stop 5KB

evaluate.py 10KB

__init__.py 959B

corpora.py 8KB

sparse_utils.py 6KB

methods

__init__.py 422B

align

parallel_align.py 3KB

align.py 9KB

__init__.py 246B

set_similarity

search_index.py 5KB

set_similarity.py 4KB

__init__.py 222B

pairwise_chunked.py 1KB

vsm

soft_cosine.py 10KB

lexical.py 1KB

__init__.py 95B

embeddings.py 2KB

base.py 4KB

transportation.py 3KB

wmd.py 1KB

embeddings.py 13KB

viz

app.py 5KB

templates

index.html 3KB

viz.py 533B

static

utils.js 86B

vendor

css

bulma-0.9.0.min.css 196KB

d3-interpolate.v1.min.js 8KB

d3-scale-chromatic.v1.min.js 19KB

d3-color.v1.min.js 10KB

d3.v4.min.js 217KB

jquery-3.5.1.min.js 87KB

main.js 7KB

tabs.js 508B

main.css 168B

__init__.py 32B

data.py 23KB

text_reuse_retrieve.egg-info

PKG-INFO 420B

requires.txt 238B

SOURCES.txt 2KB

top_level.txt 9B

dependency_links.txt 1B

ext

_align.c 169KB

_align.pyx 2KB

setup.cfg 79B

setup.py 1KB

README.md 6KB

RETRIEVE: A Text Reuse Software Package -------- RETRIEVE is designed with the goal of making accessible a number of text reuse retrieval algorithmic paradigms (more concretely, three paradigms) to a broader audience. The focus of RETRIEVE is small and medium scale retrieval problems (collections in the order of **tens of thousands** to **hundreds of thousands** of sentences). However, the software tries to optimize memory space when possible, and with some manual tuning (and waiting time) it could eventually be ran on larger problems. As of today, RETRIEVE implements the retrieval of text reuse around words. While it would be easy to extend it so that subword strings are used instead, this has been left out for now from the current implementation (PRs welcome). Typically, text reuse retrieval software improves results by using lemmatized input. RETRIEVE tries to do a specialized thing well, and, therefore, does not include lemmatization in the pipeline. This means that you will have to provide lemmatized input, if you want to profit from the benefits of lemmatization. If you don't have a lemmatizer available for your language of choice, I recommend using [PIE](https://github.com/emanjavacas/pie/), which you can use to train your own lemmatizer. # Installation ## Installing from PyPI RETRIEVE can be installed from PyPI using `pip`. Just fire up a command prompt and type: ```pip install text-reuse-retrieve``` ## Installation from source RETRIEVE can also be installed by first downloading the repository, installing the dependencies and issuing the `python setup.py install` command within the top directory. Dependencies are kept in the `requirements.txt` file. To install them, use: ```pip install -r requirements.txt``` # Workflow The workflow consists of the following steps: - Data Preparation: gathering sentences on which to carry out the text reuse search - Text Preprocessing: processing input documents so as to facilitate the subsequent search - Search: running search algorithms on the input collections ## Data preparation RETRIEVE doesn't offer many tools in order to aid the data preparation process, but just functionality to load the resources and operate on them. The most important resources is a lemmatizer (see remarks in the Introduction), and, eventually, the curation of stopword lists. Additionally, the subsequence text preprocessing can be improved if POS-tags are available. Loading is done with the `Collection` class, which is the appropriate input format for the search algorithms implemented in RETRIEVE. A `Collection` is built around individual `Doc` instances. A `Doc` is just a data structure that holds the input text, as well as a document id and some textual metadata if available. A `Collection` can be loaded using the `Collection.from_file` and `Collection.from_csv` methods, or can be manually instantiated by manually creating indivual `Doc` instances and passing them to the `Collection` constructor. ```python from retrieve.data import Doc, Collection line1 = ['The', 'cat', 'sat', 'on', 'the', 'mat', '.'] line2 = ['The', 'dog', 'jumped', 'on', 'the', 'mat', '.'] coll1 = Collection([Doc({'token': line1}, 'cat-doc')]) coll2 = Collection([Doc({'token': line2}, 'dog-doc')]) ``` `Collection.from_file` assumes that the input are files with a sentence per line (although it can also perform shingling on the input text). `Collection.from_csv` uses a csv file (or more) as input. Typically, this file will have one `token`, a `lemma` and `pos` fields. ``` $ head input.csv token lemma pos The the DET cat cat N sat sit V ``` ## Preprocessing Preprocessing is done with the `TextPreprocessor` class. In order to lowercase the input and filter out punctuation and stopwords, we can use the following snippet. For this example, we use one of the built-in datasets that come prepackaged with RETRIEVE. ```python >>> from retrieve.data import TextPreprocessor >>> from retrieve.utils import Stopwords >>> from retrieve.corpora import load_vulgate >>> coll = load_vulgate() >>> TextPreprocessor( stopwords=Stopwords('latin.stop'), lower=True, drop_punctuation=True >>> ).process_collections(coll) >>> coll[0].get_features() ['principium', 'creo', 'deus', 'caelum', 'terra'] ``` We can also compute n-grams using the `min_n` and `max_n` arguments. ```python >>> coll = load_vulgate() >>> TextPreprocessor( stopwords=Stopwords('latin.stop'), lower=True, drop_punctuation=True >>> ).process_collections(coll, min_n=1, max_n=3) >>> coll[0].get_features() ['principium', 'creo', 'deus', 'caelum', 'terra', 'principium--creo', 'creo--deus', 'deus--caelum', 'caelum--terra', 'principium--creo--deus', 'creo--deus--caelum', 'deus--caelum--terra'] ``` **Feature selection** can be done using the `retrieve.data.FeatureSelector` class, in combination with the `retrieve.data.Criterion` class. We can do feature selection based on: - document frequency (using `Criterion.DF`) - raw frequency (using `Criterion.FREQ`) - inverse document frequency (using `Criterion.IDF`) For example, in order to filter out features that occur in only isolated documents, we use the following code. ```python >>> vocab = FeatureSelector(coll).filter_collections(coll, criterion=(Criterion.DF >= 2)) ``` `FeatureSelector.filter_collections` returns the vocabulary of features after filtering. `Criterion` can be combined using ordinary operators. For example, `(Criterion.DF >= 2) & (Criterion.FREQ >= 5)` drops hapaxes and features with less than 5 occurrences overall. ## Search RETRIEVE implements three algorithm families. - Set-based (Inverted-list approaches to efficient set-similarity measures) - VSM-based (Vector Space Models including an optimized implementation of the soft-cosine measure) - Local text alignment (Smith-Waterman) ### Set-based ### VSM ### Text-Alignment # Quickrun For simplicity, all functionality has been packed into a single `pipeline` function. # Visualization

评论收藏

内容反馈

版权申诉