QANTAQuizBowlAI_python_代码_下载资源-CSDN文库

共208个文件

py：108个

yaml：60个

csv：9个

版权申诉

python

148 浏览量 2022-07-02 21:38:35 上传评论收藏 546KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

QANTA Quiz Bowl AI_python_代码_下载（208个子文件）

luigi.cfg 250B

setup.cfg 61B

2017_nips.csv 38KB

2016_hsnct.csv 31KB

2015_jennings.csv 25KB

2017_hsnct.csv 22KB

2016_naacl.csv 22KB

2015_hsnct.csv 21KB

2017_nips.power.csv 878B

2015_jennings.power.csv 841B

2017_hsnct.power.csv 688B

.gitignore 901B

mypy.ini 79B

luigi-template.sh.jinja2 701B

2018_pace.json 37KB

LICENSE 1KB

poetry.lock 210KB

README.md 12KB

new_performance.md 3KB

guesser.md 3KB

bug_report.md 665B

expo.md 598B

feature_request.md 595B

bad_data.md 169B

log4j2.properties 4KB

figures.py 26KB

buzzer.py 24KB

dan.py 23KB

rnn.py 22KB

new_performance.py 21KB

abstract.py 20KB

answer_mapping.py 20KB

elasticsearch.py 18KB

guesser.py 15KB

wikidata.py 15KB

dataset.py 14KB

normalization.py 12KB

pipeline.py 11KB

jmlr_diversity.py 11KB

main.py 11KB

trickme.py 11KB

buzz_example.py 11KB

cli.py 10KB

performance.py 9KB

elmo.py 9KB

cached_wikipedia.py 8KB

vw.py 8KB

util.py 7KB

trick-data.py 7KB

protobowl.py 7KB

eval.py 7KB

2p_buzzer.py 7KB

nets.py 7KB

__init__.py 7KB

plot.py 6KB

agent.py 6KB

game.py 6KB

quiz_bowl.py 6KB

train.py 6KB

annotated_mapping.py 6KB

data.py 6KB

qb_stats.py 5KB

nets.py 5KB

greedy_remove.py 5KB

preprocess.py 5KB

dataset.py 5KB

preprocess.py 5KB

dataset.py 4KB

nn.py 4KB

preprocess.py 4KB

quizdb.py 4KB

preprocess.py 4KB

curve_score.py 4KB

shared_task_to_buzz.py 4KB

end_to_end.py 4KB

nn.py 4KB

protobowl_user.py 4KB

hook.py 4KB

dataset.py 4KB

model.py 3KB

guesser.py 3KB

display_util.py 3KB

command.py 3KB

gspreadsheets.py 3KB

constants.py 3KB

io.py 3KB

second_best.py 3KB

main.py 3KB

tfidf.py 3KB

test.py 2KB

try_guesser.py 2KB

get_highlights.py 2KB

test.py 2KB

classifier.py 2KB

multiprocess.py 2KB

train.py 2KB

buzzer.py 2KB

hyperparam.py 2KB

util.py 2KB

all.py 2KB

共 208 条

# QANTA ## Downloading Data Whether you would like to use our system or use only our dataset, the easiest way to do so is use our `dataset.py` script. It is a standalone script whose only dependencies are python 3.6 and the package `click` which can be installed via `pip install click`. The following commands can be used to download our dataset, or datasets we use in either the system or paper plots. Data will be downloaded to `data/external/datasets` by default, but can be changed with the `--local-qanta-prefix` option * `./dataset.py download`: Download only the qanta dataset * `./dataset.py download wikidata`: Download our preprocessed wikidata.org `instance of` attributes * `./dataset.py download plotting`: Download the squad, simple questions, jeopardy, and triviaqa datasets we compare against in our paper plots and tables ### File Description: * `qanta.unmapped.2018.04.18.json`: All questions in our dataset, without mapped Wikipedia answers. Sourced from protobowl and quizdb. Light preprocessing has been applied to remove quiz bowl specific syntax such as instructions to moderators * `qanta.processed.2018.04.18.json`: Prior dataset with added fields extracting the first sentence, and sentence tokenizations of the question paragraph for convenience. * `qanta.mapped.2018.04.18.json`: The processed dataset with Wikipedia pages matched to the answer where possible. This includes all questions, even those without matched pages. * `qanta.2018.04.18.sqlite3`: Equivalent to `qanta.mapped.2018.04.18.json` but in sqlite3 format * `qanta.train.2018.04.18.json`: Training data which is the mapped dataset filtered down to only questions with non-null page matches * `qanta.dev.2018.04.18.json`: Dev data which is the mapped dataset filtered down to only questions with non-null page matches * `qanta.test.2018.04.18.json`: Test data which is the mapped dataset filtered down to only questions with non-null page matches ## Dependencies Install all necessary Python packages into a virtual environment by running `poetry install` in the qanta directory. Further qanta setup requiring python depedencies should be performed in the virtual environment. The virtual environment can be accessed by running `poetry shell`. ### NLTK Models ```bash # Download nltk data $ python3 nltk_setup.py ``` ### Installing Elastic Search 5.6 (only needed for Elastic Search Guesser) ```bash $ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.2.tar.gz $ tar -xvf elasticsearch-5.6.2.tar.gz ``` Install version 5.6.X, do not use 6.X. Also be sure that the directory `bin/` within the extracted files is in your `$PATH` as it contains the necessary binary `elasticsearch`. ### Qanta on Path In addition to these steps you need to include the qanta directory in your `PYTHONPATH` environment variable. We intend to fix path issues in the future by fixing absolute/relative paths. ## Configuration QANTA configuration is done through a combination of environment variables and the `qanta-defaults.yaml`/`qanta.yaml` files. QANTA will read a `qanta.yaml` first if it exists, otherwise it will fall back to reading `qanta-defaults.yaml`. This is meant to allow for custom configuration of `qanta.yaml` after copying it via `cp qanta-defaults.yaml qanta.yaml`. The configuration of most interest is how to enable or disable specific guesser implementations. In the `guesser` config the keys such as `qanta.guesser.dan.DanGuesser` correspond to the fully qualified paths of each guesser. Each of these keys contain an array of configurations (this is signified in yaml by the `-`). Our code will inspect all of these configurations looking for those that have `enabled: true`, and only run those guessers. By default we have `enabled: false` for all models. If you simply want to perform a sanity check we recommend enabling `qanta.guesser.tfidf.TfidfGuesser`. If you are looking for our best model and configuration you should use enable `qanta.guesser.rnn.RnnGuesser`. ## Running QANTA Running qanta is managed primarily by two methods: `./cli.py` and [Luigi](https://github.com/spotify/luigi). The former is used to run specific commands such as starting/stopping elastic search, but in general `luigi` is the primary method for running our system. ### Luigi Pipelines Luigi is a pure python make-like framework for running data pipelines. Below we give sample commands for running different parts of our pipeline. In general, you should either append `--local-scheduler` to all commands or learn about using the [Luigi Central Scheduler](https://luigi.readthedocs.io/en/stable/central_scheduler.html). For these common tasks you can use command `luigi --local-scheduler` followed by: * `--module qanta.pipeline.preprocess DownloadData`: This downloads any necessary data and preprocesses it. This will download a copy of our preprocessed Wikipedia stored in AWS S3 and turn it into the format used by our code. This step requires the AWS CLI, `lz4`, Apache Spark, and may require a decent amount of RAM. * `--module qanta.pipeline.guesser AllGuesserReports`: Train all enabled guessers, generate guesses for them, and produce a report of their performance into `output/guesser`. Certain tasks might require Spacy models (e.g `en_core_web_lg`) or nltk data (e.g `wordnet`) to be downloaded. See the [FAQ](#debugging-faq-and-solutions) section for more information. ### Qanta CLI You can start/stop elastic search with * `./cli.py elasticsearch start` * `./cli.py elasticsearch stop` ## AWS S3 Checkpoint/Restore To provide and easy way to version, checkpoint, and restore runs of qanta we provide a script to manage that at `aws_checkpoint.py`. We assume that you set an environment variable `QB_AWS_S3_BUCKET` to where you want to checkpoint to and restore from. We assume that we have full access to all the contents of the bucket so we suggest creating a dedicated bucket. ## Information on our data sources ### Wikipedia Dumps As part of our ingestion pipeline we access raw wikipedia dumps. The current code is based on the english wikipedia dumps created on 2017/04/01 available at https://dumps.wikimedia.org/enwiki/20170401/ Of these we use the following (you may need to use more recent dumps) * [Wikipedia page text](https://dumps.wikimedia.org/enwiki/20170401/enwiki-20170401-pages-articles-multistream.xml.bz2): This is used to get the text, title, and id of wikipedia pages * [Wikipedia titles](https://dumps.wikimedia.org/enwiki/20170401/enwiki-20170401-all-titles.gz): This is used for more convenient access to wikipedia page titles * [Wikipedia redirects](https://dumps.wikimedia.org/enwiki/20170401/enwiki-20170401-redirect.sql.gz): DB dump for wikipedia redirects, used for resolving different ways of referencing the same wikipedia entity * [Wikipedia page to ids](https://dumps.wikimedia.org/enwiki/20170401/enwiki-20170401-page.sql.gz): Contains a mapping of wikipedia page and ids, necessary for making the redirect table useful To process wikipedia we use [https://github.com/attardi/wikiextractor](https://github.com/attardi/wikiextractor) with the following command: ```bash $ WikiExtractor.py --processes 15 -o parsed-wiki --json enwiki-20170401-pages-articles-multistream.xml.bz2 ``` Do not use the flag to filter disambiguation pages. It uses a simple string regex to check the title and articles contents. This introduces both false positives and false negatives. We handle the problem of filtering these out by using the wikipedia categories dump Afterwards we use the following command to tar it, compress it with lz4, and upload the archive to S3 ```bash tar cvf - parsed-wiki | lz4 - parsed-wiki.tar.lz4 ``` #### Wikipedia Redirect Mapping Creation The output of this process is stored in `s3://pinafore-us-west-2/public/wiki_redirects.csv` All the wikipedia database dumps are provided in MySQL sql files. This guide has a good explanation of how to install MySQL which is necessary to use

评论收藏

内容反馈

版权申诉