PyPI官网下载|text_alignment_tool-0.2.10.tar.gz资源-CSDN文库

版权申诉

Python库

21 浏览量 2022-01-16 21:44:07 上传评论收藏 63KB GZ 举报

共60个文件

py：29个

pyc：27个

toml：1个

资源推荐

资源详情

资源评论

收起资源包目录

text_alignment_tool-0.2.10.tar.gz （60个子文件）

text_alignment_tool-0.2.10

PKG-INFO 12KB

pyproject.toml 861B

LICENSE 1KB

setup.py 12KB

README.md 11KB

text_alignment_tool

alignment_tool

alignment_tool.py 18KB

__pycache__

__init__.cpython-39.pyc 468B

alignment_tool.cpython-39.pyc 13KB

aligner.cpython-39.pyc 5KB

aligner.py 5KB

__init__.py 249B

find_wordlist_for_alignment

__pycache__

__init__.cpython-39.pyc 312B

find_wordlist_for_alignment.cpython-39.pyc 3KB

__init__.py 88B

find_wordlist_for_alignment.py 2KB

text_loaders

string_text_loader.py 585B

__pycache__

newline_separated_text_loader.cpython-39.pyc 2KB

alto_xml_loader.cpython-39.pyc 9KB

__init__.cpython-39.pyc 663B

pgp_xml_loader.cpython-39.pyc 4KB

text_loader.cpython-39.pyc 2KB

string_text_loader.cpython-39.pyc 1KB

newline_separated_text_loader.py 680B

text_loader.py 2KB

alto_xml_loader.py 18KB

__init__.py 419B

pgp_xml_loader.py 4KB

alignment_algorithms

chunk_alignment.py 9KB

local_alignment.py 6KB

alignment_algorithm.py 7KB

__pycache__

local_alignment.cpython-39.pyc 4KB

__init__.cpython-39.pyc 844B

alignment_algorithm.cpython-39.pyc 7KB

rough_alignment.cpython-39.pyc 4KB

chunk_alignment.cpython-39.pyc 7KB

global_alignment.cpython-39.pyc 3KB

__init__.py 608B

rough_alignment.py 9KB

global_alignment.py 3KB

shared_classes

shared_classes.py 453B

__pycache__

shared_classes.cpython-39.pyc 1KB

__init__.cpython-39.pyc 352B

__init__.py 135B

analyzers

__pycache__

analysis_helpers.cpython-39.pyc 4KB

__init__.cpython-39.pyc 457B

__init__.py 235B

analysis_helpers.py 3KB

__init__.py 1KB

text_transformers

text_transformer.py 4KB

__pycache__

substitute_character_transformer.cpython-39.pyc 3KB

text_transformer.cpython-39.pyc 4KB

__init__.cpython-39.pyc 958B

consistent_bracketing_transformer.cpython-39.pyc 4KB

substitute_multiple_characters_transform.cpython-39.pyc 5KB

remove_character_transformer.cpython-39.pyc 3KB

consistent_bracketing_transformer.py 3KB

__init__.py 720B

substitute_multiple_characters_transform.py 6KB

substitute_character_transformer.py 2KB

remove_character_transformer.py 2KB

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) # The Text Alignment Tool This Python text alignment tool is intended to be a general purpose tool for aligning texts in a robust and easily extensible way. It tracks any changes to the original text so that there is an end-to-end mapping of the alignment data. ## Architecture Inline-style: ![Diagram of Alignment Tool Pipeline Structure](./aligner_pipeline.svg "Alignment Tool Pipeline Structure") 1. The alignment tool consists of a main class `TextAlignmentTool`, which coordinates the alignment process. 2. The alignment tool receives a single `TextLoader` for the query text and a single `TextLoader` for the target text (you must keep track of the mapping from the original input text(s) in the `TextLoader` and its output to the rest of the pipeline). 3. The alignment tool is then fed *n* `TextTransformer`s and for each text and *n* `AlignmentAlgorithm`s. These can be used in any combination and order, for example: the query text could pass through 3 `TextTransformer`s and the target text could pass through 1 `TextTransformer`, then they go through a single `AlignmentAlgorithm`, the target text then passes through 2 `TextTransformer`s and we could perform a final `AlignmentAlgorithm` on the pair of texts. 4. `find_alignment_to_query` and `find_alignment_to_target` will backtrack through the text mappings and provide a key for mapping either the query to the target or the target to the query. A somewhat basic alignment process could look something like this: ```python # Create text loaders for query and target query_loader = PgpAltoXMLTextLoader(list(QUERY_TEMP_FOLDER.glob("**/*.xml"))) target_loader = PgpXmlTeiTextLoader(list(TARGET_TEMP_FOLDER.glob("**/*.xml"))) # Create the alignment tool aligner = TextAlignmentTool(query_loader, target_loader) # Perform three transformation operations on the target normalize_target_sigla = PgpTeiNormalizeSiglaTransformer() remove_target_extras = PgpTeiRemoveExtrasTransformer() relocate_insertions = PgpTeiRelocateInsertionsTransformer() aligner.target_text_transforms( [normalize_target_sigla, remove_target_extras, relocate_insertions] ) # Create and run one alignment process first_alignment_algorithm = LineAlignmentAlgorithm() aligner.align_text(first_alignment_algorithm) # Get the mapping information for the alignment alignment_mappings = aligner.latest_alignment ``` ## Functionality Tracking of text changes and mappings to aligned text use a system of index maps. The `TextLoader` will ingest the input text and output a 1-dimensional numpy uint32 array consisting of one number for each letter in the input text in the order it occurs within the text (the number is simply the unicode value of the character using python's [`ord`](https://docs.python.org/3/library/functions.html#ord) function). ### Text Loader For example, let's imagine we have our initial text in a simple text file, and we will assume the line breaks are significant for the alignment process: ``` Lorem ipsum dolor sit amet consectetur adipiscing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua ``` We would write a simple loader for this to ingest the text and preserve a record of the line breaks: ```python from text_alignment_tool import TextChunk import numpy as np text = """Lorem ipsum dolor sit amet consectetur adipiscing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua""" def parse_text(text: str) -> tuple[list[tuple[int, int]], list[TextChunk], np.array]: input_output_map: list[tuple[int, int]] = [] text_chunk_indices: list[TextChunk] = [] output_text: list[int] = [] text_chunk_start_idx = 0 for input_idx, char in enumerate(text): output_idx = len(output_text) input_output_map.append((input_idx, output_idx)) if char == "\n": text_chunk_indices.append(TextChunk(text_chunk_start_idx, output_idx)) text_chunk_start_idx = output_idx + 1 continue output_text.append(ord(char)) return input_output_map, text_chunk_indices, np.array(output_text, dtype=np.uint32) input_output_map, text_chunk_indices, output_text = parse_text(text) # Inspect the results print(input_output_map[25:30]) print(text_chunk_indices) print(output_text[0:5]) # Deserialize text print(''.join([chr(x) for x in output_text[0:5]])) ``` Output: ``` [(25, 25), (26, 26), (27, 27), (28, 27), (29, 28)] [TextChunk(start_idx=0, end_idx=27), TextChunk(start_idx=28, end_idx=55)] [ 76 111 114 101 109 ] Lorem ``` When creating a custom text loader, you should subclass `TextLoader` and make sure to calculate `self._output`, `self._input_output_map`, and `self._text_chunk_indices`. You can modify the \_\_init\_\_() method to take whatever variables you need, and you can modify the class however it is needed to perform the parsing operation. It is a nice addition to include a method in the custom `TextLoader` to rebuild text in the input format based upon the data from the alignment operation. ### Text Transformer The output of the `TextLoader` may be exactly what is needed for the alignment process, but often it will be necessary to perform other alterations such as stripping out unneccesary characters, performing some rule based character conversions, or refining the text_chunks. Any number of `TextTransformer`s can be used in series to accomplish this. Using narrowly focused `TextTransformer`s will make it easier to debug and to mix and match `TextTransformer` as needed to achieve the desired alignment. When passing a text through a `TextTransformer`, the transformer must use its `_input_output_map` to track how it has changed the input. For instance, if we wanted to create a transformer to remove the word "the", we might start with a text input like "the quick brown dog jumped over the lazy fox.", which in the alignment tool is: ```[116, 104, 101, 32, 113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]``` The output from the `TextTransformer` would be: ```[113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]``` And the `_input_output_map` would show the mappings from the index of the input array to the index of the output array: ```[(4,0),(5,1),(6,2),(7,3), ...]``` | input val | map input idx to output idx | output val | | :-------: | :-------------------------: | :--------: | | 113 | (4,0) | 113 | | 117 | (5,1) | 117 | | 105 | (6,2) | 105 | | 99 | (7,3) | 99 | | 107 | (8,4) | 107 | | 32 | (9,5) | 32 | | ... | ... | ... | Changing the order of individual elements in the list is also possible, for instance for the same input above we could instead have the output: ```[98, 114, 111, 119, 110, 32, 113, 117, 105, 99, 107, 32, 116, 104, 101, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]``` The words "the" and "brown" have been transposed, and the resulting `_input_output_map` would be: ```[(10,0),(11,1),(12,2),(13,3),(14,4),(3,5),(4,6),(5,7),(6,8),(7,9),(8,10),(9,11),(0,12),(1,13),(2,14),(15,15), ...]``` The `TextTransformer` may also redefine text chunks with the `_text_chunk_indices` property, which is a simple ordered list of starting + ending indices that define *n* sections of the output text (you may use overlapping sections if desired), e.g., `[(0,20),(21,35),(30,91)]` with three chunks of the text using

评论收藏

内容反馈

版权申诉