A Deep Learning Toolkit for DTI, Drug Property, PPI, DDI, Protein Function Prediction (Bioinformatics)
Insights
kexinhuang12345/DeepPurpose
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
master
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more .
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Failed to load latest commit information.
Type
Oct 26, 2021
View code
A Deep Learning Library for Compound and Protein Modeling
DTI, Drug Property, PPI, DDI, Protein Function Prediction
Applications in Drug Repurposing, Virtual Screening, QSAR, Side Effect Prediction and More
This repository hosts DeepPurpose, a Deep Learning Based Molecular Modeling and Prediction Toolkit on Drug-Target Interaction Prediction, Compound Property Prediction, Protein-Protein Interaction Prediction, and Protein Function prediction (using PyTorch). We focus on DTI and its applications in Drug Repurposing and Virtual Screening, but support various other molecular encoding tasks. It allows very easy usage (several lines of codes only) to facilitate deep learning for life science research.
News!
[05/21] 0.1.2 Support 5 new graph neural network based models for compound encoding (DGL_GCN, DGL_NeuralFP, DGL_GIN_AttrMasking, DGL_GIN_ContextPred, DGL_AttentiveFP), implemented using DGL Life Science ! An example is provided here !
[12/20] DeepPurpose is now supported by TDC data loader, which contains a large collection of ML for therapeutics datasets, including many drug property, DTI datasets. Here is a tutorial !
[12/20] DeepPurpose can now be installed via pip!
[11/20] DeepPurpose is published in Bioinformatics !
[11/20] Added 5 more pretrained models on BindingDB IC50 Units (around 1Million data points).
[10/20] Google Colab Installation Instructions are provided here . Thanks to @hima111997 !
[10/20] Using DeepPurpose, we made a humans-in-the-loop molecular design web UI interface, check it out! [ Website , paper ]
[09/20] DeepPurpose has now supported three more tasks: DDI, PPI and Protein Function Prediction! You can simply call from DeepPurpose import DDI/PPI/ProteinPred to use, checkout examples below!
[07/20] A simple web UI for DTI prediction can be created under 10 lines using Gradio ! A demo is provided here .
[07/20] A blog is posted on the Towards Data Science Medium column, check this out!
[07/20] Two tutorials are online to go through DeepPurpose's framework to do drug-target interaction prediction and drug property prediction ( DTI , Drug Property ).
[05/20] Support drug property prediction for screening data that does not have target proteins such as bacteria! An example using RDKit2D with DNN for training and repurposing for pseudomonas aeruginosa (MIT AI Cures's open task ) is provided as a demo .
[05/20] Now supports hyperparameter tuning via Bayesian Optimization through the Ax platform ! A demo is provided in here .
Features
15+ powerful encodings for drugs and proteins, ranging from deep neural network on classic cheminformatics fingerprints, CNN, transformers to message passing graph neural network, with 50+ combined models! Most of the combinations of the encodings are not yet in existing works. All of these under 10 lines but with lots of flexibility! Switching encoding is as simple as changing the encoding names!
Realistic and user-friendly design:
support DTI, DDI, PPI, molecular property prediction, protein function predictions!
automatic identification to do drug target binding affinity (regression) or drug target interaction prediction (binary) task.
support cold target, cold drug settings for robust model evaluations and support single-target high throughput sequencing assay data setup.
many dataset loading/downloading/unzipping scripts to ease the tedious preprocessing, including antiviral, COVID19 targets, BindingDB, DAVIS, KIBA, ...
many pretrained checkpoints.
easy monitoring of training process with detailed training metrics output such as test set figures (AUCs) and tables, also support early stopping.
detailed output records such as rank list for repurposing result.
various evaluation metrics: ROC-AUC, PR-AUC, F1 for binary task, MSE, R-squared, Concordance Index for regression task.
label unit conversion for skewed label distribution such as Kd.
time reference for computational expensive encoding.
PyTorch based, support CPU, GPU, Multi-GPUs.
NOTE: We are actively looking for constructive advices/user feedbacks/experiences on using DeepPurpose! Please open an issue or contact us .
Cite Us
If you found this package useful, please cite our paper :
@article{huang2020deeppurpose, title={DeepPurpose: A Deep Learning Library for Drug-Target Interaction Prediction}, author={Huang, Kexin and Fu, Tianfan and Glass, Lucas M and Zitnik, Marinka and Xiao, Cao and Sun, Jimeng}, journal={Bioinformatics}, year={2020} }
Installation
Try it on Binder ! Binder is a cloud Jupyter Notebook interface that will install our environment dependency for you.
Video tutorial to install Binder.
We recommend to install it locally since Binder needs to be refreshed every time launching. To install locally, we recommend to install from pip:
pip
conda create -n DeepPurpose python=3.6 conda activate DeepPurpose conda install -c conda-forge rdkit conda install -c conda-forge notebook pip install git+https://github.com/bp-kelley/descriptastorus pip install DeepPurpose
Build from Source
First time:
git clone https://github.com/kexinhuang12345/DeepPurpose.git ## Download code repository cd DeepPurpose ## Change directory to DeepPurpose conda env create -f environment.yml ## Build virtual environment with all packages installed using conda conda activate DeepPurpose ## Activate conda environment (use "source activate DeepPurpose" for anaconda 4.4 or earlier) jupyter notebook ## open the jupyter notebook with the conda env ## run our code, e.g. click a file in the DEMO folder ... ... conda deactivate ## when done, exit conda environment
In the future:
cd DeepPurpose ## Change directory to DeepPurpose conda activate DeepPurpose ## Activate conda environment jupyter notebook ## open the jupyter notebook with the conda env ## run our code, e.g. click a file in the DEMO folder ... ... conda deactivate ## when done, exit conda environment
Video tutorial to install locally from source.
Example
Case Study 1(a): A Framework for Drug Target Interaction Prediction, with less than 10 lines of codes.
In addition to the DTI prediction, we also provide repurpose and virtual screening functions to rapidly generation predictions.
Click here for the code!
from DeepPurpose import DTI as models from DeepPurpose.utils import * from DeepPurpose.dataset import * # Load Data, an array of SMILES for drug, an array of Amino Acid Sequence for Target and an array of binding values/0-1 label. # e.g. ['Cc1ccc(CNS(=O)(=O)c2ccc(s2)S(N)(=O)=O)cc1', ...], ['MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTH...', ...], [0.46, 0.49, ...] # In this example, BindingDB with Kd binding score is used. X_drug, X_target, y = process_BindingDB(download_BindingDB(SAVE_PATH), y = 'Kd', binary = False, convert_to_log = True) # Type in the encoding names for drug/protein. drug_encoding, target_encoding = 'CNN', 'Transformer' # Data processing, here we select cold protein split setup. train, val, test = data_process(X_drug, X_target, y, drug_encoding, target_encoding, split_method='cold_protein', frac=[0.7,0.1,0.2]) # Generate new model using default parameters; also allow model tuning via input parameters. config = generate_config(drug_encoding, target_encoding, transformer_n_layer_target = 8) net = models.model_initialize(**config) # Train the new model. # Detailed output including a tidy table storing validation loss, metrics, AUC curves figures and etc. are stored in the ./result folder. net.train(train, val, test) # or simply load pretrained model from a model directory path or reproduced model name such as DeepDTA net = models.model_pretrained(MODEL_PATH_DIR or MODEL_NAME) # Repurpose using the trained model or pre-trained model # In this example, loading repurposing dataset using Broad Repurposing Hub and SARS-CoV 3CL Protease Target. X_repurpose, drug_name, drug_cid = load_broad_repurposing_hub(SAVE_PATH) target, target_name = load_SARS_CoV_Protease_3CL() _ = models.repurpose(X_repurpose, target, net, drug_name, target_name) # Virtual screening using the trained model or pre-trained model X_repurpose, drug_name, target, target_name = ['CCCCCCCOc1cccc(c1)C([O-])=O', ...], ['16007391', ...], ['MLARRKPVLPALTINPTIAEGPSPTSEGASEANLVDLQKKLEEL...', ...], ['P36896', 'P00374'] _ = models.virtual_screening(X_repurpose, target, net, drug_name, target_name)
Case Study 1(b): A Framework for Drug Property Prediction, with less than 10 lines of codes.
Many dataset is in the form of high throughput screening data, which have only drug and its activity score. It can be formulated as a drug property prediction task. We also provide a repurpose function to predict over large space of drugs.
Click here for the code!
from DeepPurpose import CompoundPred as models from DeepPurpose.utils import * from DeepPurpose.dataset import * # load AID1706 Assay Data X_drugs, _, y = load_AID1706_SARS_CoV_3CL() drug_encoding = 'rdkit_2d_normalized' train, val, test = data_process(X_drug = X_drugs, y = y, drug_encoding = drug_encoding, split_method='random', random_seed = 1) config = generate_config(drug_encoding = drug_encoding, cls_hidden_dims = [512], train_epoch = 20, LR = 0.001, batch_size = 128, ) model = models.model_initialize(**config) model.train(train, val, test) X_repurpose, drug_name, drug_cid = load_broad_repurposing_hub(SAVE_PATH) _ = models.repurpose(X_repurpose, model, drug_name)
Case Study 1(c): A Framework for Drug-Drug Interaction Prediction, with less than 10 lines of codes.
DDI is very important for drug safety profiling and the success of clinical trials. This framework predicts interaction based on drug pairs chemical structure.
Click here for the code!
from DeepPurpose import DDI as models from DeepPurpose.utils import * from DeepPurpose.dataset import * # load DB Binary Data X_drugs, X_drugs_, y = read_file_training_dataset_drug_drug_pairs() drug_encoding = 'rdkit_2d_normalized' train, val, test = data_process(X_drug = X_drugs, X_drug_ = X_drugs_, y = y, drug_encoding = drug_encoding, split_method='random', random_seed = 1) config = generate_config(drug_encoding = drug_encoding, cls_hidden_dims = [512], train_epoch = 20, LR = 0.001, batch_size = 128, ) model = models.model_initialize(**config) model.train(train, val, test)
Case Study 1(d): A Framework for Protein-Protein Interaction Prediction, with less than 10 lines of codes.
PPI is important to study the relations among targets.
Click here for the code!
from DeepPurpose import PPI as models from DeepPurpose.utils import * from DeepPurpose.dataset import * # load DB Binary Data X_targets, X_targets_, y = read_file_training_dataset_protein_proteins_pairs() target_encoding = 'CNN' train, val, test = data_process(X_target = X_targets, X_target_ = X_targets_, y = y, target_encoding = target_encoding, split_method='random', random_seed = 1) config = generate_config(target_encoding = target_encoding, cls_hidden_dims = [512], train_epoch = 20, LR = 0.001, batch_size = 128, ) model = models.model_initialize(**config) model.train(train, val, test)
Case Study 1(e): A Framework for Protein Function Prediction, with less than 10 lines of codes.
Protein function prediction help predict various useful functions such as GO terms, structural classification and etc. Also, for biologics drugs, it is also useful for screening.
Click here for the code!
from DeepPurpose import ProteinPred as models from DeepPurpose.utils import * from DeepPurpose.dataset import * # load DB Binary Data X_targets, y = read_file_protein_function() target_encoding = 'CNN' train, val, test = data_process(X_target = X_targets, y = y, target_encoding = target_encoding, split_method='random', random_seed = 1) config = generate_config(target_encoding = target_encoding, cls_hidden_dims = [512], train_epoch = 20, LR = 0.001, batch_size = 128, ) model = models.model_initialize(**config) model.train(train, val, test)
Case Study 2 (a): Antiviral Drugs Repurposing for SARS-CoV2 3CLPro, using One Line.
Given a new target sequence (e.g. SARS-CoV2 3CL Protease), retrieve a list of repurposing drugs from a curated drug library of 81 antiviral drugs. The Binding Score is the Kd values. Results aggregated from five pretrained model on BindingDB dataset! (Caution: this currently is for educational purposes. The pretrained DTI models only cover a small dataset and thus cannot generalize to every new unseen protein. For best use case, train your own model with customized data.)
Click here for the code!
from DeepPurpose import oneliner from DeepPurpose.dataset import * oneliner.repurpose(*load_SARS_CoV2_Protease_3CL(), *load_antiviral_drugs(no_cid = True))
----output---- Drug Repurposing Result for SARS-CoV2 3CL Protease +------+----------------------+------------------------+---------------+ | Rank | Drug Name | Target Name | Binding Score | +------+----------------------+------------------------+---------------+ | 1 | Sofosbuvir | SARS-CoV2 3CL Protease | 190.25 | | 2 | Daclatasvir | SARS-CoV2 3CL Protease | 214.58 | | 3 | Vicriviroc | SARS-CoV2 3CL Protease | 315.70 | | 4 | Simeprevir | SARS-CoV2 3CL Protease | 396.53 | | 5 | Etravirine | SARS-CoV2 3CL Protease | 409.34 | | 6 | Amantadine | SARS-CoV2 3CL Protease | 419.76 | | 7 | Letermovir | SARS-CoV2 3CL Protease | 460.28 | | 8 | Rilpivirine | SARS-CoV2 3CL Protease | 470.79 | | 9 | Darunavir | SARS-CoV2 3CL Protease | 472.24 | | 10 | Lopinavir | SARS-CoV2 3CL Protease | 473.01 | | 11 | Maraviroc | SARS-CoV2 3CL Protease | 474.86 | | 12 | Fosamprenavir | SARS-CoV2 3CL Protease | 487.45 | | 13 | Ritonavir | SARS-CoV2 3CL Protease | 492.19 | ....
Case Study 2(b): Repurposing using Customized training data, with One Line.
Given a new target sequence (e.g. SARS-CoV 3CL Pro), training on new data (AID1706 Bioassay), and then retrieve a list of repurposing drugs from a proprietary library (e.g. antiviral drugs). The model can be trained from scratch or finetuned from the pretraining checkpoint!
Click here for the code!
from DeepPurpose import oneliner from DeepPurpose.dataset import * oneliner.repurpose(*load_SARS_CoV_Protease_3CL(), *load_antiviral_drugs(no_cid = True), *load_AID1706_SARS_CoV_3CL(), \ split='HTS', convert_y = False, frac=[0.8,0.1,0.1], pretrained = False, agg = 'max_effect')
----output---- Drug Repurposing Result for SARS-CoV 3CL Protease +------+----------------------+-----------------------+-------------+-------------+ | Rank | Drug Name | Target Name | Interaction | Probability | +------+----------------------+-----------------------+-------------+-------------+ | 1 | Remdesivir | SARS-CoV 3CL Protease | YES | 0.99 | | 2 | Efavirenz | SARS-CoV 3CL Protease | YES | 0.98 | | 3 | Vicriviroc | SARS-CoV 3CL Protease | YES | 0.98 | | 4 | Tipranavir | SARS-CoV 3CL Protease | YES | 0.96 | | 5 | Methisazone | SARS-CoV 3CL Protease | YES | 0.94 | | 6 | Letermovir | SARS-CoV 3CL Protease | YES | 0.88 | | 7 | Idoxuridine | SARS-CoV 3CL Protease | YES | 0.77 | | 8 | Loviride | SARS-CoV 3CL Protease | YES | 0.76 | | 9 | Baloxavir | SARS-CoV 3CL Protease | YES | 0.74 | | 10 | Ibacitabine | SARS-CoV 3CL Protease | YES | 0.70 | | 11 | Taribavirin | SARS-CoV 3CL Protease | YES | 0.65 | | 12 | Indinavir | SARS-CoV 3CL Protease | YES | 0.62 | | 13 | Podophyllotoxin | SARS-CoV 3CL Protease | YES | 0.60 | ....
Demos
Checkout 10+ demos & tutorials to start:
Name
Checkout Dataset Tutorial .
Pretrained models
We provide more than 10 pretrained models. Please see Pretraining Model Tutorial on how to load them. It is as simple as
from DeepPurpose import DTI as models net = models.model_pretrained(model = 'MPNN_CNN_DAVIS') or net = models.model_pretrained(FILE_PATH)
The list of avaiable pretrained models:
Model name consists of first the drug encoding, then the target encoding and then the trained dataset.
Note that for DTI models, the BindingDB and DAVIS are trained on log scale. But DeepPurpose allows you to specify conversion between log scale (e.g. pIC50) and original scale by the variable convert_y.
Click here for the models supported!
Model Name