PyPI官网下载|genienlp-0.7.0a1.tar.gz资源-CSDN文库

版权申诉

89 浏览量 2022-02-10 11:19:47 上传评论收藏 167KB GZ 举报

共78个文件

py：67个

txt：5个

pkg-info：2个

《PyPI官网下载 | genienlp-0.7.0a1.tar.gz——Python库与分布式技术的结合》 PyPI（Python Package Index）是Python社区的重要资源库，它为全球开发者提供了海量的Python软件包，供用户下载、安装和使用。在本案例中，我们关注的是一个名为"genienlp"的Python库，其版本为0.7.0a1，以.tar.gz格式打包在PyPI官网上提供下载。这个库可能是由开发团队为了特定的自然语言处理（NLP）任务而设计的。 genienlp-0.7.0a1.tar.gz文件是一个归档文件，它包含了genienlp库的所有源代码、文档、测试用例和其他相关资源。使用Python的tarfile和gzip模块，我们可以方便地解压和安装这个库，以便在本地环境中使用。解压后，通常会得到一个包含setup.py或者pyproject.toml等文件的目录结构，这些文件用于指导Python的构建工具如pip或setuptools进行安装。 Python库genienlp专注于NLP领域，意味着它可能包含预训练模型、实用函数、数据处理工具等，以帮助开发者执行诸如文本分类、命名实体识别、语义解析等任务。0.7.0a1是一个alpha版本，这意味着它仍处于早期开发阶段，可能含有未解决的问题，但对开发者来说，它是获取最新特性和改进的一个机会。与genienlp关联的标签包括"zookeeper"、"分布式"、"云原生"和"cloud native"，这暗示了该库可能具有与分布式系统和云环境集成的能力。Zookeeper是一个流行的分布式协调服务，常用于管理大型分布式系统的配置信息、命名服务和群集状态。genienlp可能利用Zookeeper来实现跨节点的同步和协调，确保在分布式环境中稳定运行。 "云原生"（cloud native）概念强调应用程序应设计为可移植和弹性，充分利用云计算的优势。genienlp可能采用了容器化技术（如Docker）和微服务架构，使其能够轻松地部署在各种云平台上，如AWS、Google Cloud或Azure，并实现弹性扩展和快速迭代。 genienlp-0.7.0a1.tar.gz不仅代表了一个在NLP领域具有潜力的Python库，而且也揭示了它在现代云环境和分布式系统中的应用前景。对于开发者而言，理解和掌握这样的库，不仅能提升他们在NLP领域的专业技能，也有助于他们更好地适应和利用云原生的开发模式，为构建大规模、高可用性的应用奠定基础。

资源推荐

资源详情

资源评论

收起资源包目录

genienlp-0.7.0a1.tar.gz （78个子文件）

genienlp-0.7.0a1

PKG-INFO 11KB

pyproject.toml 219B

LICENSE 3KB

genienlp.egg-info

PKG-INFO 11KB

requires.txt 395B

SOURCES.txt 2KB

entry_points.txt 53B

top_level.txt 9B

dependency_links.txt 1B

setup.cfg 38B

setup.py 3KB

genienlp

server.py 14KB

models

common.py 9KB

transformer_sequence_classification.py 4KB

transformer_seq2seq.py 16KB

__init__.py 2KB

identity_encoder.py 7KB

transformer_lstm.py 11KB

mqan_decoder.py 14KB

base.py 4KB

transformer_token_classification.py 4KB

validate.py 14KB

tasks

hf_dataset.py 4KB

base_task.py 4KB

generic_task.py 11KB

__init__.py 2KB

almond_dataset.py 7KB

registry.py 3KB

almond_task.py 34KB

base_dataset.py 8KB

generic_dataset.py 88KB

hf_task.py 5KB

cache_embeddings.py 2KB

paraphrase

run_generation.py 32KB

GPT2Seq2Seq.py 4KB

model_utils.py 16KB

dataset.py 11KB

__init__.py 0B

data_utils.py 17KB

scripts

dialog_to_tsv.py 4KB

split_dataset.py 930B

transform_dataset.py 12KB

__init__.py 0B

clean_paraphrasing_dataset.py 9KB

run_lm_finetuning.py 38KB

train.py 30KB

metrics.py 23KB

model_utils

saver.py 4KB

parallel_utils.py 6KB

__init__.py 2KB

translation.py 10KB

transformers_utils.py 20KB

predict.py 23KB

kfserver.py 3KB

data_utils

example.py 10KB

iterator.py 7KB

__init__.py 0B

progbar.py 3KB

decoder_vocab.py 3KB

almond_utils.py 7KB

numericalizer.py 34KB

util.py 33KB

calibrate.py 31KB

sts

sts_calculate_scores.py 4KB

sts_filter.py 3KB

__init__.py 0B

arguments.py 28KB

__main__.py 5KB

__init__.py 2KB

write_kf_metrics.py 3KB

export.py 3KB

run_bootleg.py 12KB

ned

bootleg.py 14KB

ned_utils.py 4KB

main.py 14KB

__init__.py 330B

abstract.py 9KB

README.md 11KB

# Genie NLP library [![Build Status](https://travis-ci.com/stanford-oval/genienlp.svg?branch=master)](https://travis-ci.com/stanford-oval/genienlp) [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/stanford-oval/genienlp.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/stanford-oval/genienlp/context:python) This library contains the NLP models for the [Genie](https://github.com/stanford-oval/genie-toolkit) toolkit for virtual assistants. It is derived from the [decaNLP](https://github.com/salesforce/decaNLP) library by Salesforce, but has diverged significantly. The library is suitable for all NLP tasks that can be framed as Contextual Question Answering, that is, with 3 inputs: - text or structured input as _context_ - text input as _question_ - text or structured output as _answer_ As the work by [McCann et al.](https://arxiv.org/abs/1806.08730) shows, many NLP tasks can be framed in this way. Genie primarily uses the library for semantic parsing, paraphrasing, translation, and dialogue state tracking, and this is what the models work best for. ## Installation genienlp is available on PyPi. You can install it with: ```bash pip3 install genienlp ``` After installation, `genienlp` command becomes available. ## Usage ### Training a semantic parser The general form is: ```bash genienlp train --tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> <flags> ``` The `<datadir>` should contain a single folder called "almond" (the name of the task). That folder should contain the files "train.tsv" and "eval.tsv" for train and dev set respectively. To train a BERT-LSTM (or other MLM-based model) use: ```bash genienlp train --tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \ --model TransformerLSTM --pretrained_model bert-base-cased --trainable_decoder_embedding 50 ``` To train a BART or other Seq2Seq model, use: ```bash genienlp train --tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \ --model TransformerSeq2Seq --pretrained_model facebook/bart-large --gradient_accumulation_steps 20 ``` The default batch sizes are tuned for training on a single V100 GPU. Use `--train_batch_tokens` and `--val_batch_size` to control the batch sizes. See `genienlp train --help` for the full list of options. **NOTE**: the BERT-LSTM model used by the current version of the library is not comparable with the one used in our published paper (cited below), because the input preprocessing is different. If you wish to compare with published results you should use genienlp <= 0.5.0. ### Inference on a semantic parser In batch mode: ```bash genienlp predict --tasks almond --data <datadir> --path <model_dir> --eval_dir <output> ``` The `<datadir>` should contain a single folder called "almond" (the name of the task). That folder should contain the files "train.tsv" and "eval.tsv" for train and dev set respectively. The result of batch prediction will be saved in `<output>/almond/valid.tsv`, as a TSV file containing ID and prediction. In interactive mode: ```bash genienlp server --path <model_dir> ``` Opens a TCP server that listens to requests, formatted as JSON objects containing `id` (the ID of the request), `task` (the name of the task), `context` and `question`. The server writes out JSON objects containing `id` and `answer`. The server listens to port 8401 by default, use `--port` to specify a different port or `--stdin` to use standard input/output instead of TCP. ### Calibrating a trained model Calibrate the confidence scores of a trained model: 1. Calcualate and save confidence features of the evaluation set in a pickle file: ```bash genienlp predict --task almond --data <datadir> --path <model_dir> --save_confidence_features --confidence_feature_path <confidence_feature_file> ``` 2. Train a boosted tree to map confidence features to a score between 0 and 1: ```bash genienlp calibrate --confidence_path <confidence_feature_file> --save <calibrator_directory> --name_prefix <calibrator_name> ```` 3. Now if you provide `--calibrator_paths` during prediction, it will output confidence scores for each output: ```bash genienlp predict --tasks almond --data <datadir> --path <model_dir> --calibrator_paths <calibrator_directory>/<calibrator_name>.calib ``` ### Paraphrasing Train a paraphrasing model: ```bash genienlp train-paraphrase --train_data_file <train_data_file> --eval_data_file <dev_data_file> --output_dir <model_dir> --model_type gpt2 --do_train --do_eval --evaluate_during_training --logging_steps 1000 --save_steps 1000 --max_steps 40000 --save_total_limit 2 --gradient_accumulation_steps 16 --per_gpu_eval_batch_size 4 --per_gpu_train_batch_size 4 --num_train_epochs 1 --model_name_or_path <gpt2/gpt2-medium/gpt2-large/gpt2-xlarge> ``` Generate paraphrases: ```bash genienlp run-paraphrase --model_name_or_path <model_dir> --temperature 0.3 --repetition_penalty 1.0 --num_samples 4 --batch_size 32 --input_file <input_tsv_file> --input_column 1 ``` ### Translation Use the following command for training/ finetuning an NMT model: ```bash genienlp train --train_tasks almond_translate --data <data_directory> --train_languages <src_lang> --eval_languages <tgt_lang> --no_commit --train_iterations <iterations> --preserve_case --save <save_dir> --exist_ok --skip_cache --model TransformerSeq2Seq --pretrained_model <hf_model_name> ``` We currently support MarianMT, MBART, MT5, and M2M100 models.<br> To save a pretrained model in genienlp format without any finetuning, set train_iterations to 0. You can then use this model to do inference. To produce translations for an eval/ test set run the following command: ```bash genienlp predict --tasks almond_translate --data <data_directory> --pred_languages <src_lang> --pred_tgt_languages <tgt_lang> --path <path_to_saved_model> --eval_dir <eval_dir> --skip_cache --val_batch_size 4000 --evaluate <valid/test> --overwrite --silent ``` If your dataset is a document or contains long examples, pass `--translate_example_split` to break the examples down into individual sentences before translation for better results. <br> To use alignment as described in our localization paper (cited below), use `--replace_qp` and `--force_replace_qp` which ensures the parameters between quotations marks in the sentence are preserved in the output. The alignment code has been updated and improved since 0.6.0 release, so if you wish to compare the results use genienlp <=0.6.0. However, we recommend using the newer version for higher translation quality. ### Named Entity Disambiguation First run a bootleg model to extract mentions, entity candidates, and contextual embeddings for the mentions. ```bash genienlp bootleg-dump-features --train_tasks <train_task_names> --save <savedir> --preserve_case --data <dataset_dir> --train_batch_tokens 1200 --val_batch_size 2000 --database_type json --database_dir <database_dir> --min_entity_len 1 --max_entity_len 4 --bootleg_model <bootleg_model> ``` This command generates several output files. In `<dataset_dir>` you should see a `prep` dir which contains preprocessed data (e.g. data converted to memory-mapped format, several array to facilitate embedding lookup etc.) If your dataset doesn't change you can reuse the same files. It will also generate several files in <results_temp> folder. In `eval_bootleg/[train|eval]/<bootleg_model>/bootleg_lables.jsonl` you can see the examples, mentions, predicted candidates and their probabilities according to bootleg. Now you can use the extracted features from bootleg in downstream tasks such as semantic parsing to improve named entity understanding and consequently generation: ```bash genienlp train --train_tasks <train_task_names> --train_iterations <iterations> --preserve_case --save <savedir> --data <dataset_dir> --model TransformerSeq2Seq --pretrained_model facebook/bart-base --train_batch_tokens 1000 --val_batch_size 1000

评论收藏

内容反馈

版权申诉