RETRIEVE: A Text Reuse Software Package
--------
RETRIEVE is designed with the goal of making accessible a number of text reuse retrieval algorithmic paradigms (more concretely, three paradigms) to a broader audience.
The focus of RETRIEVE is small and medium scale retrieval problems (collections in the order of **tens of thousands** to **hundreds of thousands** of sentences). However, the software tries to optimize memory space when possible, and with some manual tuning (and waiting time) it could eventually be ran on larger problems.
As of today, RETRIEVE implements the retrieval of text reuse around words. While it would be easy to extend it so that subword strings are used instead, this has been left out for now from the current implementation (PRs welcome).
Typically, text reuse retrieval software improves results by using lemmatized input. RETRIEVE tries to do a specialized thing well, and, therefore, does not include lemmatization in the pipeline. This means that you will have to provide lemmatized input, if you want to profit from the benefits of lemmatization. If you don't have a lemmatizer available for your language of choice, I recommend using [PIE](https://github.com/emanjavacas/pie/), which you can use to train your own lemmatizer.
# Installation
## Installing from PyPI
RETRIEVE can be installed from PyPI using `pip`. Just fire up a command prompt and type:
```pip install text-reuse-retrieve```
## Installation from source
RETRIEVE can also be installed by first downloading the repository, installing the dependencies and issuing the `python setup.py install` command within the top directory.
Dependencies are kept in the `requirements.txt` file. To install them, use:
```pip install -r requirements.txt```
# Workflow
The workflow consists of the following steps:
- Data Preparation: gathering sentences on which to carry out the text reuse search
- Text Preprocessing: processing input documents so as to facilitate the subsequent search
- Search: running search algorithms on the input collections
## Data preparation
RETRIEVE doesn't offer many tools in order to aid the data preparation process, but just functionality to load the resources and operate on them. The most important resources is a lemmatizer (see remarks in the Introduction), and, eventually, the curation of stopword lists. Additionally, the subsequence text preprocessing can be improved if POS-tags are available.
Loading is done with the `Collection` class, which is the appropriate input format for the search algorithms implemented in RETRIEVE.
A `Collection` is built around individual `Doc` instances. A `Doc` is just a data structure that holds the input text, as well as a document id and some textual metadata if available. A `Collection` can be loaded using the `Collection.from_file` and `Collection.from_csv` methods, or can be manually instantiated by manually creating indivual `Doc` instances and passing them to the `Collection` constructor.
```python
from retrieve.data import Doc, Collection
line1 = ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']
line2 = ['The', 'dog', 'jumped', 'on', 'the', 'mat', '.']
coll1 = Collection([Doc({'token': line1}, 'cat-doc')])
coll2 = Collection([Doc({'token': line2}, 'dog-doc')])
```
`Collection.from_file` assumes that the input are files with a sentence per line (although it can also perform shingling on the input text).
`Collection.from_csv` uses a csv file (or more) as input. Typically, this file will have one `token`, a `lemma` and `pos` fields.
```
$ head input.csv
token lemma pos
The the DET
cat cat N
sat sit V
```
## Preprocessing
Preprocessing is done with the `TextPreprocessor` class. In order to lowercase the input and filter out punctuation and stopwords, we can use the following snippet. For this example, we use one of the built-in datasets that come prepackaged with RETRIEVE.
```python
>>> from retrieve.data import TextPreprocessor
>>> from retrieve.utils import Stopwords
>>> from retrieve.corpora import load_vulgate
>>> coll = load_vulgate()
>>> TextPreprocessor(
stopwords=Stopwords('latin.stop'), lower=True, drop_punctuation=True
>>> ).process_collections(coll)
>>> coll[0].get_features()
['principium', 'creo', 'deus', 'caelum', 'terra']
```
We can also compute n-grams using the `min_n` and `max_n` arguments.
```python
>>> coll = load_vulgate()
>>> TextPreprocessor(
stopwords=Stopwords('latin.stop'), lower=True, drop_punctuation=True
>>> ).process_collections(coll, min_n=1, max_n=3)
>>> coll[0].get_features()
['principium',
'creo',
'deus',
'caelum',
'terra',
'principium--creo',
'creo--deus',
'deus--caelum',
'caelum--terra',
'principium--creo--deus',
'creo--deus--caelum',
'deus--caelum--terra']
```
**Feature selection** can be done using the `retrieve.data.FeatureSelector` class, in combination with the `retrieve.data.Criterion` class. We can do feature selection based on:
- document frequency (using `Criterion.DF`)
- raw frequency (using `Criterion.FREQ`)
- inverse document frequency (using `Criterion.IDF`)
For example, in order to filter out features that occur in only isolated documents, we use the following code.
```python
>>> vocab = FeatureSelector(coll).filter_collections(coll, criterion=(Criterion.DF >= 2))
```
`FeatureSelector.filter_collections` returns the vocabulary of features after filtering.
`Criterion` can be combined using ordinary operators. For example, `(Criterion.DF >= 2) & (Criterion.FREQ >= 5)` drops hapaxes and features with less than 5 occurrences overall.
## Search
RETRIEVE implements three algorithm families.
- Set-based (Inverted-list approaches to efficient set-similarity measures)
- VSM-based (Vector Space Models including an optimized implementation of the soft-cosine measure)
- Local text alignment (Smith-Waterman)
### Set-based
### VSM
### Text-Alignment
# Quickrun
For simplicity, all functionality has been packed into a single `pipeline` function.
# Visualization
没有合适的资源?快使用搜索试试~ 我知道了~
PyPI 官网下载 | text-reuse-retrieve-0.1.10.tar.gz
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 190 浏览量
2022-02-02
06:40:36
上传
评论
收藏 6.91MB GZ 举报
温馨提示
共68个文件
py:36个
js:8个
txt:4个
资源来自pypi官网。 资源全名:text-reuse-retrieve-0.1.10.tar.gz
资源推荐
资源详情
资源评论
收起资源包目录
text-reuse-retrieve-0.1.10.tar.gz (68个子文件)
text-reuse-retrieve-0.1.10
MANIFEST.in 194B
PKG-INFO 420B
test
test_align.py 1014B
test_sparse.py 1KB
test_data.py 866B
test_soft_cosine.py 3KB
test_set.py 2KB
LICENSE 1KB
src
retrieve
utils.py 4KB
pipeline.py 7KB
resources
misc
latin.lemma.ft.dim100.mc2.embeddings.gz 2.98MB
new-testament.books 239B
__init__.py 0B
blb.refs.json 89KB
old-testament.books 382B
texts
__init__.py 0B
vulgate.zip 3.71MB
__init__.py 0B
stop
english.stop 743B
latin_berra.stop 13KB
latin.stop 913B
__init__.py 0B
greek_berra.stop 5KB
evaluate.py 10KB
__init__.py 959B
corpora.py 8KB
sparse_utils.py 6KB
methods
__init__.py 422B
align
parallel_align.py 3KB
align.py 9KB
__init__.py 246B
set_similarity
search_index.py 5KB
set_similarity.py 4KB
__init__.py 222B
pairwise_chunked.py 1KB
vsm
soft_cosine.py 10KB
lexical.py 1KB
__init__.py 95B
embeddings.py 2KB
base.py 4KB
transportation.py 3KB
wmd.py 1KB
embeddings.py 13KB
viz
app.py 5KB
templates
index.html 3KB
viz.py 533B
static
utils.js 86B
vendor
css
bulma-0.9.0.min.css 196KB
js
d3-interpolate.v1.min.js 8KB
d3-scale-chromatic.v1.min.js 19KB
d3-color.v1.min.js 10KB
d3.v4.min.js 217KB
jquery-3.5.1.min.js 87KB
main.js 7KB
tabs.js 508B
main.css 168B
__init__.py 32B
data.py 23KB
text_reuse_retrieve.egg-info
PKG-INFO 420B
requires.txt 238B
SOURCES.txt 2KB
top_level.txt 9B
dependency_links.txt 1B
ext
_align.c 169KB
_align.pyx 2KB
setup.cfg 79B
setup.py 1KB
README.md 6KB
共 68 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 12w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功