{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TDC 101: Data Loaders\n", "\n", "[Kexin](https://twitter.com/KexinHuang5)\n", "\n", "Welcome to the TDC community! In this first tutorial, we will cover the basics of TDC and after this tutorial, you will be able to access all of the TDC datasets!\n", "\n", "### What is TDC?\n", "\n", "Therapeutics Data Commons (TDC) is an open and extensive data hub that includes 50+ machine learning (ML) ready datasets across 20+ therapeutic tasks, ranging from target discovery, activity screening, efficacy, safety to manufacturing, covering small molecule, antibodies, miRNA and other therapeutics products.\n", "\n", "TDC mainly consists of two parts: the first core part is the datasets loaders where you can get the ML ready data; and the second part is the data functions where we provide numerous functions to help process the data, evaluate the ML models, etc. Let's get started!\n", "\n", "### Installation\n", "\n", "The core TDC library only relies on a few lightweight packages: pandas, scikit-learn, numpy, fuzzywuzzy, tqdm. For some specific use case requiring other packages, we provide instructions when you use it. To install TDC, simply type:\n", "\n", "```bash\n", "pip install PyTDC\n", "```\n", "\n", "Now, you have TDC installed! We can get started to introduce the dataset loader! \n", "\n", "TDC covers a wide range of therapeutics tasks with varying data structures. Thus, we organize it into three layers of hierarchies. First, we divide into three distinctive machine learning **problems**:\n", "\n", "* Single-instance prediction `single_pred`: Prediction of property given individual biomedical entity.\n", "* Multi-instance prediction `multi_pred`: Prediction of property given multiple biomedical entities. \n", "* Generation `generation`: Generation of new biomedical entity.\n", "\n", "The second layer is **task**. Each therapeutic task falls into one of the machine learning problem. We create a data loader class for every *task* that inherits from the base problem data loader. \n", "\n", "The last layer is **dataset**, where each task consists of many of them. As the data structure of most datasets in a task is the same, the dataset is used as a function input to the task data loader.\n", "\n", "Supposed a dataset X is from therapeutic task Y with machine learning problem Z, then to obtain the data and splits, simply type:\n", "\n", "```python\n", "from tdc.Z import Y\n", "data = Y(name = 'X')\n", "split = data.split()\n", "```\n", "For example, to obtain the Caco2 dataset from ADME therapeutic task in the single-instance prediction problem:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Found local copy...\n", "Loading...\n", "Done!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Drug_IDDrugY
0(-)-epicatechinO1c2c(CC(O)C1c1cc(O)c(O)cc1)c(O)cc(O)c2-6.22
1(2E,4Z,8Z)-N-isobutyldodeca-2,4,10-triene-8 -y...O=C(NCC(C)C)\\C=C\\C=C/CCC#C\\C=C/C-3.86
\n", "
" ], "text/plain": [ " Drug_ID \\\n", "0 (-)-epicatechin \n", "1 (2E,4Z,8Z)-N-isobutyldodeca-2,4,10-triene-8 -y... \n", "\n", " Drug Y \n", "0 O1c2c(CC(O)C1c1cc(O)c(O)cc1)c(O)cc(O)c2 -6.22 \n", "1 O=C(NCC(C)C)\\C=C\\C=C/CCC#C\\C=C/C -3.86 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from tdc.single_pred import ADME\n", "data = ADME(name = 'Caco2_Wang')\n", "data.get_data().head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The object `data` is a data loader class, which contains various functions. For example, you can use the `get_data()` function to get the full dataset in Pandas Data Frame. You can also specify `get_data(format = 'dict)` to get a dictionary formatted data where each key is a column name, and each value entry is the column.\n", "\n", "Note that as many dataset names are complicated with cases and stuff, in the backend, TDC supports fuzzy search, so you don't need to worry about typing wrong a few characters!\n", "\n", "You can see that in addition to the ID and labels, we also provide the input feature for the dataset. For example, for drugs, we provide SMILES string, for proteins, amino acid sequence, and so on.\n", "\n", "You can then split the dataset into training, validation and testing pandas data frames, processed in ML ready format:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Drug_IDDrugY
0VLA-4 antagonist 3S1CN(S(=O)(=O)c2cn(nc2)C)[C@H](C(=O)N[C@@H](Cc...-5.17
1AstilbinO1[C@@H](C)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]1O[...-6.82
\n", "
" ], "text/plain": [ " Drug_ID Drug Y\n", "0 VLA-4 antagonist 3 S1CN(S(=O)(=O)c2cn(nc2)C)[C@H](C(=O)N[C@@H](Cc... -5.17\n", "1 Astilbin O1[C@@H](C)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]1O[... -6.82" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "split = data.get_split()\n", "split['test'].head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can specify the splitting method, random seed, and split fractions in the function by e.g. `data.get_split(method = 'random', seed = 1, frac = [0.7, 0.1, 0.2])`. In default, the split `method` is random split, the `seed` is the TDC benchmark seed (`seed = 'benchmark'`), and the fraction `frac` is 70%, 10%, 20% for training, validation and testing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it! That's how you can get to any dataset in the TDC. TDC provides more than 50 datasets across more than 20 tasks. To explore the full datasets, **please visit our [website](https://zitniklab.hms.harvard.edu/TDC/)!** \n", "\n", "In the next set of tutorials, we are going to cover \n", "\n", "* [TDC 102 Data Functions](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_102_Data_Functions.ipynb)\n", "\n", "* [TDC 103 Part 1: Datasets - Small Molecules](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_103.1_Datasets_Small_Molecules.ipynb)\n", "\n", "* [TDC 103 Part 2: Datasets - Biologics](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_103.2_Datasets_Biologics.ipynb)\n", "\n", "* [TDC 104 ML Model Examples with DeepPurpose](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_104_ML_Model_DeepPurpose.ipynb)\n", "\n", "Check them out!\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:DeepPurpose]", "language": "python", "name": "conda-env-DeepPurpose-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }