[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
# The Text Alignment Tool
This Python text alignment tool is intended to be a general purpose tool for aligning texts in a robust and easily extensible way. It tracks any changes to the original text so that there is an end-to-end mapping of the alignment data.
## Architecture
Inline-style:
![Diagram of Alignment Tool Pipeline Structure](./aligner_pipeline.svg "Alignment Tool Pipeline Structure")
1. The alignment tool consists of a main class `TextAlignmentTool`, which coordinates the alignment process.
2. The alignment tool receives a single `TextLoader` for the query text and a single `TextLoader` for the target text (you must keep track of the mapping from the original input text(s) in the `TextLoader` and its output to the rest of the pipeline).
3. The alignment tool is then fed *n* `TextTransformer`s and for each text and *n* `AlignmentAlgorithm`s. These can be used in any combination and order, for example: the query text could pass through 3 `TextTransformer`s and the target text could pass through 1 `TextTransformer`, then they go through a single `AlignmentAlgorithm`, the target text then passes through 2 `TextTransformer`s and we could perform a final `AlignmentAlgorithm` on the pair of texts.
4. `find_alignment_to_query` and `find_alignment_to_target` will backtrack through the text mappings and provide a key for mapping either the query to the target or the target to the query.
A somewhat basic alignment process could look something like this:
```python
# Create text loaders for query and target
query_loader = PgpAltoXMLTextLoader(list(QUERY_TEMP_FOLDER.glob("**/*.xml")))
target_loader = PgpXmlTeiTextLoader(list(TARGET_TEMP_FOLDER.glob("**/*.xml")))
# Create the alignment tool
aligner = TextAlignmentTool(query_loader, target_loader)
# Perform three transformation operations on the target
normalize_target_sigla = PgpTeiNormalizeSiglaTransformer()
remove_target_extras = PgpTeiRemoveExtrasTransformer()
relocate_insertions = PgpTeiRelocateInsertionsTransformer()
aligner.target_text_transforms(
[normalize_target_sigla, remove_target_extras, relocate_insertions]
)
# Create and run one alignment process
first_alignment_algorithm = LineAlignmentAlgorithm()
aligner.align_text(first_alignment_algorithm)
# Get the mapping information for the alignment
alignment_mappings = aligner.latest_alignment
```
## Functionality
Tracking of text changes and mappings to aligned text use a system of index maps. The `TextLoader` will ingest the input text and output a 1-dimensional numpy uint32 array consisting of one number for each letter in the input text in the order it occurs within the text (the number is simply the unicode value of the character using python's [`ord`](https://docs.python.org/3/library/functions.html#ord) function).
### Text Loader
For example, let's imagine we have our initial text in a simple text file, and we will assume the line breaks are significant for the alignment process:
```
Lorem ipsum dolor sit amet
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
```
We would write a simple loader for this to ingest the text and preserve a record of the line breaks:
```python
from text_alignment_tool import TextChunk
import numpy as np
text = """Lorem ipsum dolor sit amet
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua"""
def parse_text(text: str) -> tuple[list[tuple[int, int]], list[TextChunk], np.array]:
input_output_map: list[tuple[int, int]] = []
text_chunk_indices: list[TextChunk] = []
output_text: list[int] = []
text_chunk_start_idx = 0
for input_idx, char in enumerate(text):
output_idx = len(output_text)
input_output_map.append((input_idx, output_idx))
if char == "\n":
text_chunk_indices.append(TextChunk(text_chunk_start_idx, output_idx))
text_chunk_start_idx = output_idx + 1
continue
output_text.append(ord(char))
return input_output_map, text_chunk_indices, np.array(output_text, dtype=np.uint32)
input_output_map, text_chunk_indices, output_text = parse_text(text)
# Inspect the results
print(input_output_map[25:30])
print(text_chunk_indices)
print(output_text[0:5])
# Deserialize text
print(''.join([chr(x) for x in output_text[0:5]]))
```
Output:
```
[(25, 25), (26, 26), (27, 27), (28, 27), (29, 28)]
[TextChunk(start_idx=0, end_idx=27), TextChunk(start_idx=28, end_idx=55)]
[ 76 111 114 101 109 ]
Lorem
```
When creating a custom text loader, you should subclass `TextLoader` and make sure to calculate `self._output`, `self._input_output_map`, and `self._text_chunk_indices`. You can modify the \_\_init\_\_() method to take whatever variables you need, and you can modify the class however it is needed to perform the parsing operation. It is a nice addition to include a method in the custom `TextLoader` to rebuild text in the input format based upon the data from the alignment operation.
### Text Transformer
The output of the `TextLoader` may be exactly what is needed for the alignment process, but often it will be necessary to perform other alterations such as stripping out unneccesary characters, performing some rule based character conversions, or refining the text_chunks. Any number of `TextTransformer`s can be used in series to accomplish this. Using narrowly focused `TextTransformer`s will make it easier to debug and to mix and match `TextTransformer` as needed to achieve the desired alignment.
When passing a text through a `TextTransformer`, the transformer must use its `_input_output_map` to track how it has changed the input. For instance, if we wanted to create a transformer to remove the word "the", we might start with a text input like "the quick brown dog jumped over the lazy fox.", which in the alignment tool is:
```[116, 104, 101, 32, 113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]```
The output from the `TextTransformer` would be:
```[113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]```
And the `_input_output_map` would show the mappings from the index of the input array to the index of the output array:
```[(4,0),(5,1),(6,2),(7,3), ...]```
| input val | map input idx to output idx | output val |
| :-------: | :-------------------------: | :--------: |
| 113 | (4,0) | 113 |
| 117 | (5,1) | 117 |
| 105 | (6,2) | 105 |
| 99 | (7,3) | 99 |
| 107 | (8,4) | 107 |
| 32 | (9,5) | 32 |
| ... | ... | ... |
Changing the order of individual elements in the list is also possible, for instance for the same input above we could instead have the output:
```[98, 114, 111, 119, 110, 32, 113, 117, 105, 99, 107, 32, 116, 104, 101, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]```
The words "the" and "brown" have been transposed, and the resulting `_input_output_map` would be:
```[(10,0),(11,1),(12,2),(13,3),(14,4),(3,5),(4,6),(5,7),(6,8),(7,9),(8,10),(9,11),(0,12),(1,13),(2,14),(15,15), ...]```
The `TextTransformer` may also redefine text chunks with the `_text_chunk_indices` property, which is a simple ordered list of starting + ending indices that define *n* sections of the output text (you may use overlapping sections if desired), e.g., `[(0,20),(21,35),(30,91)]` with three chunks of the text using
没有合适的资源?快使用搜索试试~ 我知道了~
PyPI 官网下载 | text_alignment_tool-0.2.10.tar.gz
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 21 浏览量
2022-01-16
21:44:07
上传
评论
收藏 63KB GZ 举报
温馨提示
共60个文件
py:29个
pyc:27个
toml:1个
资源来自pypi官网。 资源全名:text_alignment_tool-0.2.10.tar.gz
资源推荐
资源详情
资源评论
收起资源包目录
text_alignment_tool-0.2.10.tar.gz (60个子文件)
text_alignment_tool-0.2.10
PKG-INFO 12KB
pyproject.toml 861B
LICENSE 1KB
setup.py 12KB
README.md 11KB
text_alignment_tool
alignment_tool
alignment_tool.py 18KB
__pycache__
__init__.cpython-39.pyc 468B
alignment_tool.cpython-39.pyc 13KB
aligner.cpython-39.pyc 5KB
aligner.py 5KB
__init__.py 249B
find_wordlist_for_alignment
__pycache__
__init__.cpython-39.pyc 312B
find_wordlist_for_alignment.cpython-39.pyc 3KB
__init__.py 88B
find_wordlist_for_alignment.py 2KB
text_loaders
string_text_loader.py 585B
__pycache__
newline_separated_text_loader.cpython-39.pyc 2KB
alto_xml_loader.cpython-39.pyc 9KB
__init__.cpython-39.pyc 663B
pgp_xml_loader.cpython-39.pyc 4KB
text_loader.cpython-39.pyc 2KB
string_text_loader.cpython-39.pyc 1KB
newline_separated_text_loader.py 680B
text_loader.py 2KB
alto_xml_loader.py 18KB
__init__.py 419B
pgp_xml_loader.py 4KB
alignment_algorithms
chunk_alignment.py 9KB
local_alignment.py 6KB
alignment_algorithm.py 7KB
__pycache__
local_alignment.cpython-39.pyc 4KB
__init__.cpython-39.pyc 844B
alignment_algorithm.cpython-39.pyc 7KB
rough_alignment.cpython-39.pyc 4KB
chunk_alignment.cpython-39.pyc 7KB
global_alignment.cpython-39.pyc 3KB
__init__.py 608B
rough_alignment.py 9KB
global_alignment.py 3KB
shared_classes
shared_classes.py 453B
__pycache__
shared_classes.cpython-39.pyc 1KB
__init__.cpython-39.pyc 352B
__init__.py 135B
analyzers
__pycache__
analysis_helpers.cpython-39.pyc 4KB
__init__.cpython-39.pyc 457B
__init__.py 235B
analysis_helpers.py 3KB
__init__.py 1KB
text_transformers
text_transformer.py 4KB
__pycache__
substitute_character_transformer.cpython-39.pyc 3KB
text_transformer.cpython-39.pyc 4KB
__init__.cpython-39.pyc 958B
consistent_bracketing_transformer.cpython-39.pyc 4KB
substitute_multiple_characters_transform.cpython-39.pyc 5KB
remove_character_transformer.cpython-39.pyc 3KB
consistent_bracketing_transformer.py 3KB
__init__.py 720B
substitute_multiple_characters_transform.py 6KB
substitute_character_transformer.py 2KB
remove_character_transformer.py 2KB
共 60 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 12w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功