# README.md
# PretrainGNNs
[中文版本](./README_cn.md) [English Version](./README.md)
* [Background](#Background)
* [Instructions](#Instructions)
* [How to get ?](#How-to-get-?)
* [Model link](#Model-link)
* [Data link](#Data-link)
* [Training Models](#Training-Models)
* [Finetuning Models](#Finetuning-Models)
* [GNN Models](#GNN-Models)
* [GIN](#GIN)
* [GAT](#GAT)
* [GCN](#GCN)
* [Other Parameters](#Other-Parameters)
* [Compound Related Tasks](#Compound-Related-Tasks)
* [Pretraining Tasks](#Pretraining-Tasks)
* [Pre-training datasets](#Pre-training-datasets)
* [Node-level](#node-level)
* [Graph-level](#graph-level)
* [Downstream Tasks](#Downstream-Tasks)
* [Chemical molecular properties prediction](#Chemical-molecular-properties-prediction)
* [Downstream classification datasets](#Downstream-classification-datasets)
* [Fine-tuning](#Fine-tuning)
* [Evaluation results](#Evaluation-results)
* [Data](#Data)
* [How to get ?](#How-to-get-?)
* [Datasets introduction](#Datasets-introduction)
* [Reference](#Reference)
* [Paper-related](#Paper-related)
* [Data-related](#Data-related)
## Background
In recent years, deep learning has achieved good results in various fields, but there are still some limitations in the fields of molecular informatics and drug development. However, drug development is a relatively expensive and time-consuming process. The screening of pharmaceutical compounds in the middle of the process is in need for efficiency improving. In the early days, traditional machine learning methods were used to predict physical and chemical properties, and the graphs have irregular shapes and sizes. There is no spatial order on the nodes, and the neighbors of the nodes are also related to their positions. Therefore, molecular structure data can be treated as graphs, and the application development of graph networks is gradually being valued. However, in the actual training process, the model's performance is limited by missing labels, and different distributions between the training and testing set. Therefore, this article mainly adopts pre-training models on data-rich related tasks, and pre-training at the node level and the entire image level. , And then fine-tune the downstream tasks. This pre-training model refers to the paper "Strategies for Pre-training Graph Neural Networks", which provides GIN, GAT, GCN, and other models for implementation.
Therefore, we implement the model mentioned in "Strategies for Pre-training Graph Neural Networks" to mitigate this issue. The model is firstly pre-trained on the data-rich related tasks, on both node level and graph level. Then the pre-trained model is fine-tuned for the downstream tasks. As for the implementation details, we provide GIN,GAT,GCN, and other models implementation of the model.
## Instructions
### How to get ?
#### Model link
You can download the [pretrained models](https://baidu-nlp.bj.bcebos.com/PaddleHelix/pretrained_models/compound/pregnn-attrmask-supervised.zip) or train them by yourself.
#### Data link
You can choose to download the dataset from the [link](http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip) provided by us and perform the corresponding preprocessing for your use. It is recommended to unzip the data set and put it in the data folder under the root directory, if not, please create a new data folder.
# cd to PaddleHelix folder
mkdir -p data
cd data
wget http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip
unzip chem_dataset.zip
### Training Models
The training methods of the pre-training strategy we provide are divided into two aspects. The first is the pre-training at the node level. There are two methods. The second is the supervised pre-training strategy for the whole graph. You can choose during the specific experiment. Perform pre-training at the node level first, and then perform the pre-training at the graph level at the entire graph level, as follows:
![图片](./imgs/pregnn.png)
Following are the examples:
```
pretrain_attrmask.py # Node-level attribute masking pre-training file
pretrain_supervised.py # Pre-training files at the entire graph level
```
Using pretrain_attrmask.py as an example to show the usage of the model parameters:
`batch_size` : Batch size of the model, at training phase, the default value will be 256
`max_epoch` : Max epochs to train the model, can be chosen according to the compute power (Train on a single Telsa V will take about 11 minutes to finish an epoch)
`data_path` : The path you load data.First you need to download the datasets from the link we provide,It is recommended to unzip the data set and put it in the data folder under the root directory, if not, please create a new data folder.
`init_model` : init_model referes to the model without using the pre-training strategy,here is the [address](https://baidu-nlp.bj.bcebos.com/PaddleHelix/pretrained_models/compound/pregnn-attrmask-supervised.zip) of our pretrained model
`compound_encoder_config` : the path of the compound encoder config file, containing the parameters of the gnn model
`model_config` : the path of the model config file, containing the parameters of the gnn model
`dropout_rate` : the dropout rate,here we use 0.2
`model_dir` : the path to save the model
```bash
CUDA_VISIBLE_DEVICES=0 python pretrain_attrmask.py \
--batch_size=256 \
--num_workers=2 \
--max_epoch=100 \
--lr=1e-3 \
--dropout_rate=0.2 \
--data_path=../../../data/chem_dataset/zinc_standard_agent \
--compound_encoder_config=model_configs/pregnn_paper.json \
--model_config=model_configs/pre_Attrmask.json \
--model_dir=../../../output/pretrain_gnns/pretrain_attrmask
```
We provide the shell scripts to run the python files directly, you can adjust the parameters in the scripts.
```bash
sh scripts/pretrain_attrmask.sh #run pretrain_attrmask.py with given parameters
sh scripts/pretrain_supervised.sh #run pretrain_supervised.py with given parameters
```
### Finetuning Models
Fine tuning the model is similar to trainging the model. Parameters' definition is the same. The init model is the [Pre-trained Models](https://baidu-nlp.bj.bcebos.com/PaddleHelix/pretrained_models/compound/pregnn-attrmask-supervised.zip) we downloaded before, you can put it in the corresponding file.
```bash
CUDA_VISIBLE_DEVICES=0 python finetune.py \
--batch_size=32 \
--max_epoch=4 \
--dataset_name=tox21 \
--data_path=../../../data/chem_dataset/tox21 \
--compound_encoder_config=model_configs/pregnn_paper.json \
--model_config=model_configs/down_linear.json \
--init_model=../../../output/pretrain_gnns/pregnn_paper-pre_Attrmask-pre_Supervised/epoch40/compound_encoder.pdparams \
--model_dir=../../../output/pretrain_gnns/finetune/tox21 \
--encoder_lr=1e-3 \
--head_lr=1e-3 \
--dropout_rate=0.2
```
We proivde the shell script to run the fine tune file, you can adjust the parameters in the script.
```bash
sh scripts/finetune.sh #run finetune.py with given parameters
```
### GNN Models
We provides models GIN、GCN and GAT ,Here we will introduce the details of gnn models.
#### GIN
Graph Isomorphism Network (GIN) Graph Isomorphism Network It uses recursive iterative method to aggregate the node features in the graph according to the structure of the edges to calculate, and the graph characteristics after isomorphism graph processing should be the same, and non-isomorphism graph processing The following figure special certificate should be different. To use GIN, you need to set the following hyperparameters:
- hidden_size: The hidden size of GIN。
- emb
没有合适的资源?快使用搜索试试~ 我知道了~
一个生物计算工具集,是用机器学习的方法,特别是深度神经网络
共824个文件
py:505个
md:70个
sh:40个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 34 浏览量
2023-11-10
10:20:09
上传
评论
收藏 31.39MB ZIP 举报
温馨提示
螺旋桨(PaddleHelix)是一个生物计算工具集,是用机器学习的方法,特别是深度神经网络,致力于促进以下领域的发展:新药发现。提供1)大规模预训练模型:化合物和蛋白质; 2)多种应用:分子属性预测,药物靶点亲和力预测,和分子生成。疫苗设计。提供RNA设计算法,包括LinearFold和LinearPartition。精准医疗。提供药物联用的应用。
资源推荐
资源详情
资源评论
收起资源包目录
一个生物计算工具集,是用机器学习的方法,特别是深度神经网络 (824个子文件)
ginconv.py.bakk 3KB
make.bat 795B
intl22.cpp 376KB
intl21.cpp 76KB
beam_CKY_parser.cpp 30KB
beam_CKY_parser_v.cpp 24KB
beam_CKY_parser_c.cpp 23KB
intl11.cpp 15KB
energy_parameter.cpp 15KB
feature_weight.cpp 14KB
linear_rna.cpp 8KB
utility.cpp 3KB
beam_CKY_parser.cpp 853B
subword_list_chembl_bindingdb_freq_100.csv 1.58MB
subword_list_chembl_bindingdb_freq_100.csv 1.58MB
subword_list_chembl_bindingdb_freq_200.csv 816KB
subword_list_chembl_bindingdb_freq_200.csv 816KB
subword_list_chembl_freq_100.csv 652KB
subword_list_chembl_freq_100.csv 652KB
subword_list_chembl_freq_100.csv 652KB
subword_list_uniprot_freq_500.csv 327KB
subword_list_uniprot_freq_500.csv 327KB
subword_list_uniprot_freq_500.csv 327KB
wehi_pains.csv 203KB
subword_list_chembl_bindingdb_freq_1500.csv 118KB
subword_list_chembl_bindingdb_freq_1500.csv 118KB
mcf.csv 739B
demo_animo_acid_sequences 635B
T1037.fasta 437B
T1037.fasta 437B
7O9F_B.fasta 378B
T1026.fasta 202B
T1026.fasta 202B
.gitignore 310B
.gitignore 128B
.gitignore 34B
.gitignore 10B
fpscores.pkl.gz 3.67MB
publicnp.model.gz 1.98MB
beam_CKY_parser.h 11KB
utility_v.h 8KB
utility.h 7KB
fast_math.h 3KB
beam_CKY_parser.h 2KB
quick_select.h 1KB
linear_rna.h 1KB
energy_parameter.h 1KB
utility.h 1KB
feature_weight.h 1KB
state.h 1KB
beam_CKY_parser_v.h 944B
beam_CKY_parser_c.h 621B
intl22.h 254B
intl21.h 251B
intl11.h 248B
protein_pretrain_and_property_prediction_tutorial.ipynb 266KB
protein_pretrain_and_property_prediction_tutorial_cn.ipynb 265KB
drug_target_interaction_graphdta_tutorial_cn.ipynb 111KB
drug_target_interaction_graphdta_tutorial.ipynb 110KB
drug_target_interaction_moltrans_tutorial_cn.ipynb 54KB
drug_target_interaction_moltrans_tutorial.ipynb 54KB
compound_property_prediction_tutorial.ipynb 27KB
compound_property_prediction_tutorial_cn.ipynb 26KB
molecular_generation_tutorial.ipynb 13KB
molecular_generation_tutorial_cn.ipynb 12KB
linearrna_tutorial_cn.ipynb 10KB
linearrna_tutorial.ipynb 9KB
LinearRNA.jpg 75KB
paddlehelix_features.jpg 35KB
demo.json 894B
fix_prot_len_pretrain_gnn.json 538B
fix_prot_len_gat_gcn_config.json 456B
gat_gcn_config.json 454B
fix_prot_len_gcn_config.json 453B
fix_prot_len_gat_config.json 452B
fix_prot_len_gin_config.json 452B
gcn_config.json 451B
gin_config.json 450B
gat_config.json 450B
deberta_1B_bs_cp.json 392B
pregnn_feat_l8_readavg.json 388B
geognn_l8.json 383B
unsupervised_pretrain_config.json 351B
config.json 333B
config.json 333B
config.json 333B
config.json 330B
pregnn_paper.json 282B
model_config.json 244B
initial.json 243B
pretrain_gem.json 239B
transformer_secondary_structure_config.json 186B
resnet_secondary_structure_config.json 181B
demo.json 152B
finetune.json 152B
model_config.json 100B
pcqmv2.json 97B
down_mlp3.json 97B
down_mlp2.json 97B
mol_regr-optimus-mae.json 93B
共 824 条
- 1
- 2
- 3
- 4
- 5
- 6
- 9
资源评论
Java程序员-张凯
- 粉丝: 1w+
- 资源: 7158
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- map_mode_escape_1.28.13.12700.pak
- 计算2296傅奕群.html
- 《【D3.js in Action 3 精译-022】3.2 使用 D3 完成数据准备工作》配套示例源码
- rust 1.81.0 下载, windows 平台下载
- TinyRDM-1.2.0 下载, windows/linux/macos 平台下载, 源码下载
- FiddlerSetup.5 下载, windows 平台下载
- Fiddler Everywhere 5.17.0 下载, windows /linux/macos平台下载
- 2023-04-06-项目笔记 - 第二百五十五阶段 - 4.4.2.253全局变量的作用域-253 -2025.09.13
- 2023-04-06-项目笔记 - 第二百五十五阶段 - 4.4.2.253全局变量的作用域-253 -2025.09.13
- 【目标检测数据集】细胞质细胞核检测数据集599张VOC+YOLO格式.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功