# Genie NLP library
[![Build Status](https://travis-ci.com/stanford-oval/genienlp.svg?branch=master)](https://travis-ci.com/stanford-oval/genienlp) [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/stanford-oval/genienlp.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/stanford-oval/genienlp/context:python)
This library contains the NLP models for the [Genie](https://github.com/stanford-oval/genie-toolkit) toolkit for
virtual assistants. It is derived from the [decaNLP](https://github.com/salesforce/decaNLP) library by Salesforce,
but has diverged significantly.
The library is suitable for all NLP tasks that can be framed as Contextual Question Answering, that is, with 3 inputs:
- text or structured input as _context_
- text input as _question_
- text or structured output as _answer_
As the work by [McCann et al.](https://arxiv.org/abs/1806.08730) shows, many NLP tasks can be framed in this way.
Genie primarily uses the library for semantic parsing, paraphrasing, translation, and dialogue state tracking, and this is
what the models work best for.
## Installation
genienlp is available on PyPi. You can install it with:
```bash
pip3 install genienlp
```
After installation, `genienlp` command becomes available.
## Usage
### Training a semantic parser
The general form is:
```bash
genienlp train --tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> <flags>
```
The `<datadir>` should contain a single folder called "almond" (the name of the task). That folder should
contain the files "train.tsv" and "eval.tsv" for train and dev set respectively.
To train a BERT-LSTM (or other MLM-based model) use:
```bash
genienlp train --tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \
--model TransformerLSTM --pretrained_model bert-base-cased --trainable_decoder_embedding 50
```
To train a BART or other Seq2Seq model, use:
```bash
genienlp train --tasks almond --train_iterations 50000 --data <datadir> --save <model_dir> \
--model TransformerSeq2Seq --pretrained_model facebook/bart-large --gradient_accumulation_steps 20
```
The default batch sizes are tuned for training on a single V100 GPU. Use `--train_batch_tokens` and `--val_batch_size`
to control the batch sizes. See `genienlp train --help` for the full list of options.
**NOTE**: the BERT-LSTM model used by the current version of the library is not comparable with the
one used in our published paper (cited below), because the input preprocessing is different. If you
wish to compare with published results you should use genienlp <= 0.5.0.
### Inference on a semantic parser
In batch mode:
```bash
genienlp predict --tasks almond --data <datadir> --path <model_dir> --eval_dir <output>
```
The `<datadir>` should contain a single folder called "almond" (the name of the task). That folder should
contain the files "train.tsv" and "eval.tsv" for train and dev set respectively. The result of batch prediction
will be saved in `<output>/almond/valid.tsv`, as a TSV file containing ID and prediction.
In interactive mode:
```bash
genienlp server --path <model_dir>
```
Opens a TCP server that listens to requests, formatted as JSON objects containing `id` (the ID of the request),
`task` (the name of the task), `context` and `question`. The server writes out JSON objects containing `id` and
`answer`. The server listens to port 8401 by default, use `--port` to specify a different port or `--stdin` to
use standard input/output instead of TCP.
### Calibrating a trained model
Calibrate the confidence scores of a trained model:
1. Calcualate and save confidence features of the evaluation set in a pickle file:
```bash
genienlp predict --task almond --data <datadir> --path <model_dir> --save_confidence_features --confidence_feature_path <confidence_feature_file>
```
2. Train a boosted tree to map confidence features to a score between 0 and 1:
```bash
genienlp calibrate --confidence_path <confidence_feature_file> --save <calibrator_directory> --name_prefix <calibrator_name>
````
3. Now if you provide `--calibrator_paths` during prediction, it will output confidence scores for each output:
```bash
genienlp predict --tasks almond --data <datadir> --path <model_dir> --calibrator_paths <calibrator_directory>/<calibrator_name>.calib
```
### Paraphrasing
Train a paraphrasing model:
```bash
genienlp train-paraphrase --train_data_file <train_data_file> --eval_data_file <dev_data_file> --output_dir <model_dir> --model_type gpt2 --do_train --do_eval --evaluate_during_training --logging_steps 1000 --save_steps 1000 --max_steps 40000 --save_total_limit 2 --gradient_accumulation_steps 16 --per_gpu_eval_batch_size 4 --per_gpu_train_batch_size 4 --num_train_epochs 1 --model_name_or_path <gpt2/gpt2-medium/gpt2-large/gpt2-xlarge>
```
Generate paraphrases:
```bash
genienlp run-paraphrase --model_name_or_path <model_dir> --temperature 0.3 --repetition_penalty 1.0 --num_samples 4 --batch_size 32 --input_file <input_tsv_file> --input_column 1
```
### Translation
Use the following command for training/ finetuning an NMT model:
```bash
genienlp train --train_tasks almond_translate --data <data_directory> --train_languages <src_lang> --eval_languages <tgt_lang> --no_commit --train_iterations <iterations> --preserve_case --save <save_dir> --exist_ok --skip_cache --model TransformerSeq2Seq --pretrained_model <hf_model_name>
```
We currently support MarianMT, MBART, MT5, and M2M100 models.<br>
To save a pretrained model in genienlp format without any finetuning, set train_iterations to 0. You can then use this model to do inference.
To produce translations for an eval/ test set run the following command:
```bash
genienlp predict --tasks almond_translate --data <data_directory> --pred_languages <src_lang> --pred_tgt_languages <tgt_lang> --path <path_to_saved_model> --eval_dir <eval_dir> --skip_cache --val_batch_size 4000 --evaluate <valid/test> --overwrite --silent
```
If your dataset is a document or contains long examples, pass `--translate_example_split` to break the examples down into individual sentences before translation for better results. <br>
To use alignment as described in our localization paper (cited below), use `--replace_qp` and `--force_replace_qp` which ensures the parameters between quotations marks in the sentence are preserved in the output.
The alignment code has been updated and improved since 0.6.0 release, so if you wish to compare the results use genienlp <=0.6.0. However, we recommend using the newer version for higher translation quality.
### Named Entity Disambiguation
First run a bootleg model to extract mentions, entity candidates, and contextual embeddings for the mentions.
```bash
genienlp bootleg-dump-features --train_tasks <train_task_names> --save <savedir> --preserve_case --data <dataset_dir> --train_batch_tokens 1200 --val_batch_size 2000 --database_type json --database_dir <database_dir> --min_entity_len 1 --max_entity_len 4 --bootleg_model <bootleg_model>
```
This command generates several output files. In `<dataset_dir>` you should see a `prep` dir which contains preprocessed data (e.g. data converted to memory-mapped format, several array to facilitate embedding lookup etc.) If your dataset doesn't change you can reuse the same files.
It will also generate several files in <results_temp> folder. In `eval_bootleg/[train|eval]/<bootleg_model>/bootleg_lables.jsonl` you can see the examples, mentions, predicted candidates and their probabilities according to bootleg.
Now you can use the extracted features from bootleg in downstream tasks such as semantic parsing to improve named entity understanding and consequently generation:
```bash
genienlp train --train_tasks <train_task_names> --train_iterations <iterations> --preserve_case --save <savedir> --data <dataset_dir> --model TransformerSeq2Seq --pretrained_model facebook/bart-base --train_batch_tokens 1000 --val_batch_size 1000
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
genienlp-0.7.0a1.tar.gz (78个子文件)
genienlp-0.7.0a1
PKG-INFO 11KB
pyproject.toml 219B
LICENSE 3KB
genienlp.egg-info
PKG-INFO 11KB
requires.txt 395B
SOURCES.txt 2KB
entry_points.txt 53B
top_level.txt 9B
dependency_links.txt 1B
setup.cfg 38B
setup.py 3KB
genienlp
server.py 14KB
models
common.py 9KB
transformer_sequence_classification.py 4KB
transformer_seq2seq.py 16KB
__init__.py 2KB
identity_encoder.py 7KB
transformer_lstm.py 11KB
mqan_decoder.py 14KB
base.py 4KB
transformer_token_classification.py 4KB
validate.py 14KB
tasks
hf_dataset.py 4KB
base_task.py 4KB
generic_task.py 11KB
__init__.py 2KB
almond_dataset.py 7KB
registry.py 3KB
almond_task.py 34KB
base_dataset.py 8KB
generic_dataset.py 88KB
hf_task.py 5KB
cache_embeddings.py 2KB
paraphrase
run_generation.py 32KB
GPT2Seq2Seq.py 4KB
model_utils.py 16KB
dataset.py 11KB
__init__.py 0B
data_utils.py 17KB
scripts
dialog_to_tsv.py 4KB
split_dataset.py 930B
transform_dataset.py 12KB
__init__.py 0B
clean_paraphrasing_dataset.py 9KB
run_lm_finetuning.py 38KB
train.py 30KB
metrics.py 23KB
model_utils
saver.py 4KB
parallel_utils.py 6KB
__init__.py 2KB
translation.py 10KB
transformers_utils.py 20KB
predict.py 23KB
kfserver.py 3KB
data_utils
example.py 10KB
iterator.py 7KB
__init__.py 0B
progbar.py 3KB
decoder_vocab.py 3KB
almond_utils.py 7KB
numericalizer.py 34KB
util.py 33KB
calibrate.py 31KB
sts
sts_calculate_scores.py 4KB
sts_filter.py 3KB
__init__.py 0B
arguments.py 28KB
__main__.py 5KB
__init__.py 2KB
write_kf_metrics.py 3KB
export.py 3KB
run_bootleg.py 12KB
ned
bootleg.py 14KB
ned_utils.py 4KB
main.py 14KB
__init__.py 330B
abstract.py 9KB
README.md 11KB
共 78 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 13w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功