一个生物计算工具集，是用机器学习的方法，特别是深度神经网络资源-CSDN文库

共824个文件

py：505个

md：70个

sh：40个

版权申诉

人工智能

34 浏览量 2023-11-10 10:20:09 上传评论收藏 31.39MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

一个生物计算工具集，是用机器学习的方法，特别是深度神经网络（824个子文件）

ginconv.py.bakk 3KB

make.bat 795B

intl22.cpp 376KB

intl21.cpp 76KB

beam_CKY_parser.cpp 30KB

beam_CKY_parser_v.cpp 24KB

beam_CKY_parser_c.cpp 23KB

intl11.cpp 15KB

energy_parameter.cpp 15KB

feature_weight.cpp 14KB

linear_rna.cpp 8KB

utility.cpp 3KB

beam_CKY_parser.cpp 853B

subword_list_chembl_bindingdb_freq_100.csv 1.58MB

subword_list_chembl_bindingdb_freq_200.csv 816KB

subword_list_chembl_freq_100.csv 652KB

subword_list_uniprot_freq_500.csv 327KB

wehi_pains.csv 203KB

subword_list_chembl_bindingdb_freq_1500.csv 118KB

mcf.csv 739B

demo_animo_acid_sequences 635B

T1037.fasta 437B

7O9F_B.fasta 378B

T1026.fasta 202B

.gitignore 310B

.gitignore 128B

.gitignore 34B

.gitignore 10B

fpscores.pkl.gz 3.67MB

publicnp.model.gz 1.98MB

beam_CKY_parser.h 11KB

utility_v.h 8KB

utility.h 7KB

fast_math.h 3KB

beam_CKY_parser.h 2KB

quick_select.h 1KB

linear_rna.h 1KB

energy_parameter.h 1KB

utility.h 1KB

feature_weight.h 1KB

state.h 1KB

beam_CKY_parser_v.h 944B

beam_CKY_parser_c.h 621B

intl22.h 254B

intl21.h 251B

intl11.h 248B

protein_pretrain_and_property_prediction_tutorial.ipynb 266KB

protein_pretrain_and_property_prediction_tutorial_cn.ipynb 265KB

drug_target_interaction_graphdta_tutorial_cn.ipynb 111KB

drug_target_interaction_graphdta_tutorial.ipynb 110KB

drug_target_interaction_moltrans_tutorial_cn.ipynb 54KB

drug_target_interaction_moltrans_tutorial.ipynb 54KB

compound_property_prediction_tutorial.ipynb 27KB

compound_property_prediction_tutorial_cn.ipynb 26KB

molecular_generation_tutorial.ipynb 13KB

molecular_generation_tutorial_cn.ipynb 12KB

linearrna_tutorial_cn.ipynb 10KB

linearrna_tutorial.ipynb 9KB

LinearRNA.jpg 75KB

paddlehelix_features.jpg 35KB

demo.json 894B

fix_prot_len_pretrain_gnn.json 538B

fix_prot_len_gat_gcn_config.json 456B

gat_gcn_config.json 454B

fix_prot_len_gcn_config.json 453B

fix_prot_len_gat_config.json 452B

fix_prot_len_gin_config.json 452B

gcn_config.json 451B

gin_config.json 450B

gat_config.json 450B

deberta_1B_bs_cp.json 392B

pregnn_feat_l8_readavg.json 388B

geognn_l8.json 383B

unsupervised_pretrain_config.json 351B

config.json 333B

config.json 330B

pregnn_paper.json 282B

model_config.json 244B

initial.json 243B

pretrain_gem.json 239B

transformer_secondary_structure_config.json 186B

resnet_secondary_structure_config.json 181B

demo.json 152B

finetune.json 152B

model_config.json 100B

pcqmv2.json 97B

down_mlp3.json 97B

down_mlp2.json 97B

mol_regr-optimus-mae.json 93B

共 824 条

# README.md # PretrainGNNs [中文版本](./README_cn.md) [English Version](./README.md) * [Background](#Background) * [Instructions](#Instructions) * [How to get ?](#How-to-get-?) * [Model link](#Model-link) * [Data link](#Data-link) * [Training Models](#Training-Models) * [Finetuning Models](#Finetuning-Models) * [GNN Models](#GNN-Models) * [GIN](#GIN) * [GAT](#GAT) * [GCN](#GCN) * [Other Parameters](#Other-Parameters) * [Compound Related Tasks](#Compound-Related-Tasks) * [Pretraining Tasks](#Pretraining-Tasks) * [Pre-training datasets](#Pre-training-datasets) * [Node-level](#node-level) * [Graph-level](#graph-level) * [Downstream Tasks](#Downstream-Tasks) * [Chemical molecular properties prediction](#Chemical-molecular-properties-prediction) * [Downstream classification datasets](#Downstream-classification-datasets) * [Fine-tuning](#Fine-tuning) * [Evaluation results](#Evaluation-results) * [Data](#Data) * [How to get ?](#How-to-get-?) * [Datasets introduction](#Datasets-introduction) * [Reference](#Reference) * [Paper-related](#Paper-related) * [Data-related](#Data-related) ## Background In recent years, deep learning has achieved good results in various fields, but there are still some limitations in the fields of molecular informatics and drug development. However, drug development is a relatively expensive and time-consuming process. The screening of pharmaceutical compounds in the middle of the process is in need for efficiency improving. In the early days, traditional machine learning methods were used to predict physical and chemical properties, and the graphs have irregular shapes and sizes. There is no spatial order on the nodes, and the neighbors of the nodes are also related to their positions. Therefore, molecular structure data can be treated as graphs, and the application development of graph networks is gradually being valued. However, in the actual training process, the model's performance is limited by missing labels, and different distributions between the training and testing set. Therefore, this article mainly adopts pre-training models on data-rich related tasks, and pre-training at the node level and the entire image level. , And then fine-tune the downstream tasks. This pre-training model refers to the paper "Strategies for Pre-training Graph Neural Networks", which provides GIN, GAT, GCN, and other models for implementation. Therefore, we implement the model mentioned in "Strategies for Pre-training Graph Neural Networks" to mitigate this issue. The model is firstly pre-trained on the data-rich related tasks, on both node level and graph level. Then the pre-trained model is fine-tuned for the downstream tasks. As for the implementation details, we provide GIN,GAT,GCN, and other models implementation of the model. ## Instructions ### How to get ? #### Model link You can download the [pretrained models](https://baidu-nlp.bj.bcebos.com/PaddleHelix/pretrained_models/compound/pregnn-attrmask-supervised.zip) or train them by yourself. #### Data link You can choose to download the dataset from the [link](http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip) provided by us and perform the corresponding preprocessing for your use. It is recommended to unzip the data set and put it in the data folder under the root directory, if not, please create a new data folder. # cd to PaddleHelix folder mkdir -p data cd data wget http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip unzip chem_dataset.zip ### Training Models The training methods of the pre-training strategy we provide are divided into two aspects. The first is the pre-training at the node level. There are two methods. The second is the supervised pre-training strategy for the whole graph. You can choose during the specific experiment. Perform pre-training at the node level first, and then perform the pre-training at the graph level at the entire graph level, as follows: ![图片](./imgs/pregnn.png) Following are the examples: ``` pretrain_attrmask.py # Node-level attribute masking pre-training file pretrain_supervised.py # Pre-training files at the entire graph level ``` Using pretrain_attrmask.py as an example to show the usage of the model parameters: `batch_size` : Batch size of the model, at training phase, the default value will be 256 `max_epoch` : Max epochs to train the model, can be chosen according to the compute power (Train on a single Telsa V will take about 11 minutes to finish an epoch) `data_path` : The path you load data.First you need to download the datasets from the link we provide,It is recommended to unzip the data set and put it in the data folder under the root directory, if not, please create a new data folder. `init_model` : init_model referes to the model without using the pre-training strategy,here is the [address](https://baidu-nlp.bj.bcebos.com/PaddleHelix/pretrained_models/compound/pregnn-attrmask-supervised.zip) of our pretrained model `compound_encoder_config` : the path of the compound encoder config file, containing the parameters of the gnn model `model_config` : the path of the model config file, containing the parameters of the gnn model `dropout_rate` : the dropout rate,here we use 0.2 `model_dir` : the path to save the model ```bash CUDA_VISIBLE_DEVICES=0 python pretrain_attrmask.py \ --batch_size=256 \ --num_workers=2 \ --max_epoch=100 \ --lr=1e-3 \ --dropout_rate=0.2 \ --data_path=../../../data/chem_dataset/zinc_standard_agent \ --compound_encoder_config=model_configs/pregnn_paper.json \ --model_config=model_configs/pre_Attrmask.json \ --model_dir=../../../output/pretrain_gnns/pretrain_attrmask ``` We provide the shell scripts to run the python files directly, you can adjust the parameters in the scripts. ```bash sh scripts/pretrain_attrmask.sh #run pretrain_attrmask.py with given parameters sh scripts/pretrain_supervised.sh #run pretrain_supervised.py with given parameters ``` ### Finetuning Models Fine tuning the model is similar to trainging the model. Parameters' definition is the same. The init model is the [Pre-trained Models](https://baidu-nlp.bj.bcebos.com/PaddleHelix/pretrained_models/compound/pregnn-attrmask-supervised.zip) we downloaded before, you can put it in the corresponding file. ```bash CUDA_VISIBLE_DEVICES=0 python finetune.py \ --batch_size=32 \ --max_epoch=4 \ --dataset_name=tox21 \ --data_path=../../../data/chem_dataset/tox21 \ --compound_encoder_config=model_configs/pregnn_paper.json \ --model_config=model_configs/down_linear.json \ --init_model=../../../output/pretrain_gnns/pregnn_paper-pre_Attrmask-pre_Supervised/epoch40/compound_encoder.pdparams \ --model_dir=../../../output/pretrain_gnns/finetune/tox21 \ --encoder_lr=1e-3 \ --head_lr=1e-3 \ --dropout_rate=0.2 ``` We proivde the shell script to run the fine tune file, you can adjust the parameters in the script. ```bash sh scripts/finetune.sh #run finetune.py with given parameters ``` ### GNN Models We provides models GIN、GCN and GAT ,Here we will introduce the details of gnn models. #### GIN Graph Isomorphism Network (GIN) Graph Isomorphism Network It uses recursive iterative method to aggregate the node features in the graph according to the structure of the edges to calculate, and the graph characteristics after isomorphism graph processing should be the same, and non-isomorphism graph processing The following figure special certificate should be different. To use GIN, you need to set the following hyperparameters: - hidden_size: The hidden size of GIN。 - emb

评论收藏

内容反馈

版权申诉