covid19分类：使用LitCovid数据集和Hedwig库在COVID-19文献上测试文档分类资源-CSDN文库

共122个文件

py：81个

tsv：10个

md：9个

Python

需积分: 26 79 浏览量 2021-02-15 18:34:02 上传评论收藏 65.52MB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

covid19分类：使用LitCovid数据集和Hedwig库在COVID-19文献上测试文档分类（122个子文件）

LitCovid.train.csv 89.61MB

LitCovid.test.csv 26.46MB

LitCovid.dev.csv 13.39MB

.DS_Store 8KB

.DS_Store 6KB

.gitignore 38B

traditional_models.ipynb 15KB

clean_cord19_dataset.ipynb 5KB

config.json 3KB

LICENSE 11KB

README.md 4KB

internal-instructions.md 2KB

README.md 2KB

README.md 1KB

README.md 227B

README.md 184B

hedwig.png 23KB

__main__.py 15KB

abstract_processor.py 11KB

__main__.py 8KB

relevance_transfer_trainer.py 7KB

bert_trainer.py 7KB

litcovid_dataset_from_xml.py 6KB

__main__.py 6KB

relevance_transfer_evaluator.py 6KB

bert_evaluator.py 6KB

classification_trainer.py 5KB

litcovid.py 5KB

model.py 5KB

args.py 5KB

reuters.py 4KB

robust45.py 4KB

yelp2014.py 4KB

aapd.py 4KB

sst.py 4KB

bow_trainer.py 4KB

classification_evaluator.py 4KB

model.py 4KB

imdb.py 4KB

model.py 3KB

robust45_processor.py 3KB

weight_drop.py 3KB

model.py 3KB

robust04.py 3KB

bow_evaluator.py 3KB

robust05.py 3KB

rerank.py 3KB

utils.py 3KB

resample.py 2KB

test.py 2KB

embed_regularize.py 2KB

constants.py 2KB

abstract_processor.py 2KB

args.py 2KB

trainer.py 2KB

evaluate.py 1KB

optimization.py 1KB

args.py 1KB

litcovid_processor.py 1KB

train.py 1KB

sst_processor.py 1KB

args.py 1KB

yelp2014_processor.py 1KB

agnews_processor.py 1KB

sogou_processor.py 1KB

imdb_processor.py 1KB

reuters_processor.py 1KB

args.py 1KB

aapd_processor.py 1KB

evaluator.py 1KB

yelp2014_processor.py 1001B

reuters_processor.py 995B

imdb_processor.py 982B

aapd_processor.py 979B

preprocessing.py 726B

locked_dropout.py 389B

__init__.py 126B

__init__.py 49B

__init__.py 0B

cross_validation.py 0B

__init__.py 0B

共 122 条

# COVID-19 Document Classification This repo provides a platform for testing document classification models on COVID-19 Literature. It is an extension of the [Hedwig](https://github.com/castorini/hedwig) library and contains all necessary code to reproduce the results of some document classification models on a COVID-19 dataset created from the [LitCovid](https://www.ncbi.nlm.nih.gov/research/coronavirus/) collection. More information about the models tested and experiments carried out can be found [here](). The Hedwig library was modified to work with a newer version of PyTorch and the Transformers library in order to import custom models. It was also extended to adapt [DocBERT](https://arxiv.org/abs/1904.08398) to use the Longformer model. ## Data ### LitCovid Data The LitCovid document classification dataset found under the following directory can be used to reproduce the results found in the paper. ``` hedwig-data/datasets/LitCovid/ ``` You can find a version compatible with the Hedwig library `train.tsv` and a raw version `LitCovid.train.csv` which includes PMIDs for each article. We have also included a script to download the most up to date version of the LitCovid dataset by running the following commands: ```bash cd scripts bash load_litcovid.sh ``` This script will download, process and save both the processed and raw versions into the `data/FullLitCovid` directory. Please move the files to `hedwig-data/datasets/LitCovid` for further training. This new data will not maintain the train, dev, test split found in the paper. ### Word Embeddings Along with the dataset, you must download the word2vec embeddings for the traditional deep learning models. Follow these instructions from the Hedwig repo to download the embeddings. The only files needed for our purposes should be under: ``` hedwig-data/embeddings ``` Option 1. Our [Wasabi](https://wasabi.com/)-hosted mirror: ```bash $ wget http://nlp.rocks/hedwig -O hedwig-data.zip $ unzip hedwig-data.zip ``` Option 2. Our school-hosted repository, [`hedwig-data`](https://git.uwaterloo.ca/jimmylin/hedwig-data): ```bash $ git clone https://github.com/castorini/hedwig.git $ git clone https://git.uwaterloo.ca/jimmylin/hedwig-data.git ``` Next, organize your directory structure as follows: ``` . ├── hedwig └── hedwig-data ``` After cloning the hedwig-data repo, you need to unzip the embeddings and run the preprocessing script: ```bash cd hedwig-data/embeddings/word2vec tar -xvzf GoogleNews-vectors-negative300.tgz ``` ### Pre-Trained Longformer Finally, you'll need the Longformer Pre-trained model. Run the following commands to download the pre-trained Longformer from their [original repo](https://github.com/allenai/longformer). ```bash cd hedwig wget https://ai2-s2-research.s3-us-west-2.amazonaws.com/longformer/longformer-base-4096.tar.gz tar -xvzf longformer-base-4096.tar.gz ``` ### CORD-19 Test Dataset The full articles, PMIDs and labels for the CORD-19 test dataset can be found under `data/cord19_annotations.tsv`. The Jupyter notebook `scripts/clean_cord19_dataset.ipynb` creates the Hedwig compatible version found in `hedwig-data/dataset/LitCovid/cord19_test.tsv`. ## Requirements We'd recommend creating a custom pip virtual environment and install all requirements via pip: ``` $ pip install -r requirements.txt ``` Code depends on data from NLTK (e.g., stopwords) so you'll have to download them. Run the Python interpreter and type the commands: ```python >>> import nltk >>> nltk.download() ``` ## Model Training Running the following commands in the hedwig directory to train the transformer based models and the conventional deep learning models respectively: ```bash bash run_all_bert.sh bash run_all_deep.sh ``` To reproduce the results for traditional machine learning models on the LitCovid dataset run the Jupyter notebook `scripts/traditional_models.ipynb `. To reproduce the results on the CORD-19 test dataset run the following: ```bash bash run_baselines_cord19.sh ```