# QANTA
## Downloading Data
Whether you would like to use our system or use only our dataset, the easiest way to do so is
use our `dataset.py` script. It is a standalone script whose only dependencies are python 3.6 and the package `click`
which can be installed via `pip install click`.
The following commands can be used to download our dataset, or datasets we use in either the system or paper plots.
Data will be downloaded to `data/external/datasets` by default, but can be changed with the `--local-qanta-prefix`
option
* `./dataset.py download`: Download only the qanta dataset
* `./dataset.py download wikidata`: Download our preprocessed wikidata.org `instance of` attributes
* `./dataset.py download plotting`: Download the squad, simple questions, jeopardy, and triviaqa datasets we
compare against in our paper plots and tables
### File Description:
* `qanta.unmapped.2018.04.18.json`: All questions in our dataset, without mapped Wikipedia answers. Sourced from
protobowl and quizdb. Light preprocessing has been applied to remove quiz bowl specific syntax such as instructions
to moderators
* `qanta.processed.2018.04.18.json`: Prior dataset with added fields extracting the first sentence, and sentence tokenizations
of the question paragraph for convenience.
* `qanta.mapped.2018.04.18.json`: The processed dataset with Wikipedia pages matched to the answer where possible. This
includes all questions, even those without matched pages.
* `qanta.2018.04.18.sqlite3`: Equivalent to `qanta.mapped.2018.04.18.json` but in sqlite3 format
* `qanta.train.2018.04.18.json`: Training data which is the mapped dataset filtered down to only questions with non-null
page matches
* `qanta.dev.2018.04.18.json`: Dev data which is the mapped dataset filtered down to only questions with non-null
page matches
* `qanta.test.2018.04.18.json`: Test data which is the mapped dataset filtered down to only questions with non-null
page matches
## Dependencies
Install all necessary Python packages into a virtual environment by running `poetry install` in the qanta directory. Further qanta setup requiring python depedencies should be performed in the virtual environment.
The virtual environment can be accessed by running `poetry shell`.
### NLTK Models
```bash
# Download nltk data
$ python3 nltk_setup.py
```
### Installing Elastic Search 5.6
(only needed for Elastic Search Guesser)
```bash
$ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.2.tar.gz
$ tar -xvf elasticsearch-5.6.2.tar.gz
```
Install version 5.6.X, do not use 6.X. Also be sure that the directory `bin/` within the extracted files is in your
`$PATH` as it contains the necessary binary `elasticsearch`.
### Qanta on Path
In addition to these steps you need to include the qanta directory in your `PYTHONPATH` environment variable. We intend to fix path issues in the future by fixing absolute/relative paths.
## Configuration
QANTA configuration is done through a combination of environment variables and
the `qanta-defaults.yaml`/`qanta.yaml` files. QANTA will read a `qanta.yaml`
first if it exists, otherwise it will fall back to reading
`qanta-defaults.yaml`. This is meant to allow for custom configuration of
`qanta.yaml` after copying it via `cp qanta-defaults.yaml qanta.yaml`.
The configuration of most interest is how to enable or disable specific guesser
implementations. In the `guesser` config the keys such as
`qanta.guesser.dan.DanGuesser` correspond to the fully qualified paths of each
guesser. Each of these keys contain an array of configurations (this is
signified in yaml by the `-`). Our code will inspect all of these
configurations looking for those that have `enabled: true`, and only run those
guessers. By default we have `enabled: false` for all models. If you simply
want to perform a sanity check we recommend enabling
`qanta.guesser.tfidf.TfidfGuesser`. If you are looking for our best model and
configuration you should use enable `qanta.guesser.rnn.RnnGuesser`.
## Running QANTA
Running qanta is managed primarily by two methods: `./cli.py` and
[Luigi](https://github.com/spotify/luigi). The former is used to run specific
commands such as starting/stopping elastic search, but in general `luigi` is
the primary method for running our system.
### Luigi Pipelines
Luigi is a pure python make-like framework for running data pipelines. Below we
give sample commands for running different parts of our pipeline. In general,
you should either append `--local-scheduler` to all commands or learn about
using the [Luigi Central
Scheduler](https://luigi.readthedocs.io/en/stable/central_scheduler.html).
For these common tasks you can use command `luigi --local-scheduler` followed by:
* `--module qanta.pipeline.preprocess DownloadData`: This downloads any
necessary data and preprocesses it. This will download a copy of our
preprocessed Wikipedia stored in AWS S3 and turn it into the format used by our
code. This step requires the AWS CLI, `lz4`, Apache Spark, and may require a
decent amount of RAM.
* `--module qanta.pipeline.guesser AllGuesserReports`: Train all enabled
guessers, generate guesses for them, and produce a report of their
performance into `output/guesser`.
Certain tasks might require Spacy models (e.g `en_core_web_lg`) or nltk data
(e.g `wordnet`) to be downloaded. See the [FAQ](#debugging-faq-and-solutions)
section for more information.
### Qanta CLI
You can start/stop elastic search with
* `./cli.py elasticsearch start`
* `./cli.py elasticsearch stop`
## AWS S3 Checkpoint/Restore
To provide and easy way to version, checkpoint, and restore runs of qanta we provide a script to
manage that at `aws_checkpoint.py`. We assume that you set an environment variable
`QB_AWS_S3_BUCKET` to where you want to checkpoint to and restore from. We assume that we have full
access to all the contents of the bucket so we suggest creating a dedicated bucket.
## Information on our data sources
### Wikipedia Dumps
As part of our ingestion pipeline we access raw wikipedia dumps. The current code is based on the english wikipedia
dumps created on 2017/04/01 available at https://dumps.wikimedia.org/enwiki/20170401/
Of these we use the following (you may need to use more recent dumps)
* [Wikipedia page text](https://dumps.wikimedia.org/enwiki/20170401/enwiki-20170401-pages-articles-multistream.xml.bz2): This is used to get the text, title, and id of wikipedia pages
* [Wikipedia titles](https://dumps.wikimedia.org/enwiki/20170401/enwiki-20170401-all-titles.gz): This is used for more convenient access to wikipedia page titles
* [Wikipedia redirects](https://dumps.wikimedia.org/enwiki/20170401/enwiki-20170401-redirect.sql.gz): DB dump for wikipedia redirects, used for resolving different ways of referencing the same wikipedia entity
* [Wikipedia page to ids](https://dumps.wikimedia.org/enwiki/20170401/enwiki-20170401-page.sql.gz): Contains a mapping of wikipedia page and ids, necessary for making the redirect table useful
To process wikipedia we use [https://github.com/attardi/wikiextractor](https://github.com/attardi/wikiextractor)
with the following command:
```bash
$ WikiExtractor.py --processes 15 -o parsed-wiki --json enwiki-20170401-pages-articles-multistream.xml.bz2
```
Do not use the flag to filter disambiguation pages. It uses a simple string regex to check the title and articles contents. This introduces both false positives and false negatives. We handle the problem of filtering these out by using the wikipedia categories dump
Afterwards we use the following command to tar it, compress it with lz4, and upload the archive to S3
```bash
tar cvf - parsed-wiki | lz4 - parsed-wiki.tar.lz4
```
#### Wikipedia Redirect Mapping Creation
The output of this process is stored in `s3://pinafore-us-west-2/public/wiki_redirects.csv`
All the wikipedia database dumps are provided in MySQL sql files. This guide has a good explanation of how to install MySQL which is necessary to use
没有合适的资源?快使用搜索试试~ 我知道了~
QANTA Quiz Bowl AI_python_代码_下载
共208个文件
py:108个
yaml:60个
csv:9个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 148 浏览量
2022-07-02
21:38:35
上传
评论
收藏 546KB ZIP 举报
温馨提示
下载数据 无论您是想使用我们的系统还是只使用我们的数据集,最简单的方法就是使用我们的dataset.py脚本。它是一个独立脚本,其唯一依赖项是 python 3.6 和click 可以通过pip install click. 以下命令可用于下载我们的数据集,或者我们在系统或纸图中使用的数据集。数据将data/external/datasets默认下载到,但可以通过--local-qanta-prefix 选项更改 ./dataset.py download:仅下载 qanta 数据集 ./dataset.py download wikidata: 下载我们预处理的 wikidata.orginstance of属性 ./dataset.py download plotting:下载我们在论文图表和表格中比较的小队、简单问题、危险和琐事数据集 更多详情、使用方法,请下载后阅读README.md文件
资源推荐
资源详情
资源评论
收起资源包目录
QANTA Quiz Bowl AI_python_代码_下载 (208个子文件)
luigi.cfg 250B
setup.cfg 61B
2017_nips.csv 38KB
2016_hsnct.csv 31KB
2015_jennings.csv 25KB
2017_hsnct.csv 22KB
2016_naacl.csv 22KB
2015_hsnct.csv 21KB
2017_nips.power.csv 878B
2015_jennings.power.csv 841B
2017_hsnct.power.csv 688B
.gitignore 901B
mypy.ini 79B
luigi-template.sh.jinja2 701B
2018_pace.json 37KB
LICENSE 1KB
poetry.lock 210KB
README.md 12KB
new_performance.md 3KB
guesser.md 3KB
bug_report.md 665B
expo.md 598B
feature_request.md 595B
bad_data.md 169B
log4j2.properties 4KB
figures.py 26KB
buzzer.py 24KB
dan.py 23KB
rnn.py 22KB
new_performance.py 21KB
abstract.py 20KB
answer_mapping.py 20KB
elasticsearch.py 18KB
guesser.py 15KB
wikidata.py 15KB
dataset.py 14KB
normalization.py 12KB
pipeline.py 11KB
jmlr_diversity.py 11KB
main.py 11KB
trickme.py 11KB
buzz_example.py 11KB
cli.py 10KB
performance.py 9KB
elmo.py 9KB
cached_wikipedia.py 8KB
vw.py 8KB
util.py 7KB
trick-data.py 7KB
protobowl.py 7KB
eval.py 7KB
2p_buzzer.py 7KB
nets.py 7KB
__init__.py 7KB
plot.py 6KB
agent.py 6KB
game.py 6KB
quiz_bowl.py 6KB
train.py 6KB
annotated_mapping.py 6KB
data.py 6KB
qb_stats.py 5KB
nets.py 5KB
greedy_remove.py 5KB
preprocess.py 5KB
dataset.py 5KB
preprocess.py 5KB
dataset.py 4KB
nn.py 4KB
preprocess.py 4KB
quizdb.py 4KB
preprocess.py 4KB
curve_score.py 4KB
shared_task_to_buzz.py 4KB
end_to_end.py 4KB
nn.py 4KB
protobowl_user.py 4KB
hook.py 4KB
dataset.py 4KB
model.py 3KB
guesser.py 3KB
display_util.py 3KB
command.py 3KB
gspreadsheets.py 3KB
constants.py 3KB
io.py 3KB
second_best.py 3KB
main.py 3KB
tfidf.py 3KB
test.py 2KB
try_guesser.py 2KB
get_highlights.py 2KB
test.py 2KB
classifier.py 2KB
multiprocess.py 2KB
train.py 2KB
buzzer.py 2KB
hyperparam.py 2KB
util.py 2KB
all.py 2KB
共 208 条
- 1
- 2
- 3
资源评论
快撑死的鱼
- 粉丝: 1w+
- 资源: 9156
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功