{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Tutorial 104: Generate 21 ADME Predictors with 10 Lines of Code\n", "\n", "[Kexin](https://twitter.com/KexinHuang5)\n", "\n", "In the previous set of tutorials, hopefully, you are now familiarized with TDC. In this tutorial, we show through examples how to use TDC for fast ML model prototyping using DeepPurpose. Let's start introducing what is DeepPurpose.\n", "\n", "### DeepPurpose Overview\n", "DeepPurpose is a scikit learn style Deep Learning Based Molecular Modeling and Prediction Toolkit on Drug-Target Interaction Prediction, Compound Property Prediction, Protein-Protein Interaction Prediction, and Protein Function prediction. Using DeepPurpose, we can rapidly build model prototypes for various drug discovery tasks covered in TDC, such as ADME, Tox, HTS, Developability prediction, DTI, DDI, PPI, Antibody Affinity predictions. \n", "\n", "Note that DeepPurpose is developed by two of the core teams in TDC, Kexin and Tianfan, and it is now published in Bioinformatics. To start with this tutorial, please follow [DeepPurpose instructions](https://github.com/kexinhuang12345/DeepPurpose#install--usage) to set up the necessary packages. DeepPurpose also provides [tutorials](https://github.com/kexinhuang12345/DeepPurpose/blob/master/Tutorial_1_DTI_Prediction.ipynb) for you to familiarize with it. \n", "\n", "To install DeepPurpose, in your terminal, do the following:\n", "\n", "```bash\n", "conda create -n DeepPurpose python=3.6\n", "conda activate DeepPurpose\n", "conda install -c conda-forge rdkit\n", "pip install git+https://github.com/bp-kelley/descriptastorus \n", "pip install DeepPurpose\n", "pip install PyTDC --upgrade\n", "```\n", "\n", "And then open this jupyter notebook using this conda environment.\n", "\n", "We assume now you have set up the right environment. Now, we show you how to build an ADME predictor using Message Passing Neural Network (MPNN)! \n", "\n", "### Predicting HIA using MPNN with 10 Lines of Code\n", "\n", "First, let's load DeepPurpose and TDC:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from DeepPurpose import utils, CompoundPred\n", "from tdc.single_pred import ADME" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, you can get the HIA dataset from TDC. HIA is from ADME task from Single-instance prediction and we want to predict whether or not can a compound be absorped in human intestinal, i.e. given SMILES X, predict 1/0. Note that for drug property prediction, DeepPurpose takes in an array of drug SMILES string and an array of labels. You could access that directly by setting the `get_data(format = 'dict')`:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading...\n", "100%|██████████| 48.1k/48.1k [00:00<00:00, 588kiB/s]\n", "Loading...\n", "Done!\n" ] } ], "source": [ "data = ADME(name = 'HIA_Hou').get_data(format = 'dict')\n", "X, y = data['Drug'], data['Y']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or for simplicity. We also provide a DeepPurpose format, where you can directly get the correct input data:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Found local copy...\n", "Loading...\n", "Done!\n" ] } ], "source": [ "X, y = ADME(name = 'HIA_Hou').get_data(format = 'DeepPurpose')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DeepPurpose provides 8 encoders for compound, ranging from MLP on classic cheminformatics fingerprint such as Morgan, RDKit2D to deep learning models such as CNN, transformer, and MPNN. To specify the encoder, simply types the encoder name. Here, we use MPNN as an example:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "drug_encoding = 'MPNN'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we encode the data into the specified format, using `utils.data_process` function. It specifies train/validation/test split fractions, and random seed to ensure same data splits for reproducibility. **We have made DeepPurpose to accomodate the TDC benchmark split.** Simply type 'TDC' in the random seed will generate the same split as in TDC split function. The function outputs train, val, test pandas dataframes." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Drug Property Prediction Mode...\n", "in total: 578 drugs\n", "encoding drug...\n", "unique drugs: 578\n", "Done.\n" ] } ], "source": [ "train, val, test = utils.data_process(X_drug = X, \n", " y = y, \n", " drug_encoding = drug_encoding,\n", " random_seed = 'TDC')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | SMILES | \n", "Label | \n", "drug_encoding | \n", "
|---|---|---|---|
| 0 | \n", "CC(=O)N/C1=C/C=CC=C1 | \n", "1 | \n", "[[[tensor(1.), tensor(0.), tensor(0.), tensor(... | \n", "
| 1 | \n", "CC(=O)N/C1=N/N=C([S]1)[S](N)(=O)=O | \n", "1 | \n", "[[[tensor(1.), tensor(0.), tensor(0.), tensor(... | \n", "