{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TDC 103: Datasets Part 1 - Small Molecules\n",
"\n",
"[Kexin](https://twitter.com/KexinHuang5)\n",
"\n",
"In this tutorial, we will walk through various small molecule datasets provided in TDC!\n",
"\n",
"We assume you have familiarize yourself with the installations, data loaders, and data functions. If not, please visit [TDC 101 Data Loaders](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_101_Data_Loader.ipynb) and [TDC 102 Data Functions](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_102_Data_Functions.ipynb) first!\n",
"\n",
"TDC has more than 60 datasets in the first release. In this tutorial, we highlight many of them and hopefully will give users a sense of what the TDC covers. We will start with small molecule drugs and go to biologics in the next part of the tutorial. For small molecules, we introduce the dataset in the order of discovery and development pipelines. \n",
"\n",
"## Small Molecule \n",
"\n",
"### Target Discovery\n",
"\n",
"The first stage of small molecule drug discovery is target discovery, that is to identify genes for the disease of interest. This is relatively underexplored for ML usage. One way to do it is by modeling it as a prediction problem for gene-disease association (GDA). TDC includes one high quality GDA data [DisGeNET](https://www.disgenet.org/), which curates from UniProt, PsyGeNET, Orphanet, the CGI, CTD (human data), ClinGen, and the Genomics England PanelApp. We also generate disease definitions for disease and amino acid sequence for gene as input features. You can access them via:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 63.9M/63.9M [00:03<00:00, 18.2MiB/s]\n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Gene_ID
\n",
"
Gene
\n",
"
Disease_ID
\n",
"
Disease
\n",
"
Y
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVT...
\n",
"
C0019209
\n",
"
Hepatomegaly: Abnormal enlargement of the liver.
\n",
"
0.3
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVT...
\n",
"
C0036341
\n",
"
Schizophrenia: Schizophrenia is highly heritab...
\n",
"
0.3
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Gene_ID Gene Disease_ID \\\n",
"0 1 MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVT... C0019209 \n",
"1 1 MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVT... C0036341 \n",
"\n",
" Disease Y \n",
"0 Hepatomegaly: Abnormal enlargement of the liver. 0.3 \n",
"1 Schizophrenia: Schizophrenia is highly heritab... 0.3 "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.multi_pred import GDA\n",
"data = GDA(name = 'DisGeNET')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Gene_ID is the GenBank GeneID and the Disease_ID is the Concept ID from MedGen. We can see the association distribution by:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEWCAYAAAB/tMx4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deXhU5dn48e89ySSTjQQSNgkYNhGrgIKK2FqK9BURtK24VVG6WbV1aaXta9WKr7jUttal1Z9oLS6oULVVFKyKpm5ACbLLKgYIiyEBsm+TeX5/nJOQhEkySWZOZrk/15UrM3OW+54zM/c885xzniPGGJRSSkUeV3cnoJRSqnO0gCulVITSAq6UUhFKC7hSSkUoLeBKKRWhtIArpVSE0gLuIBHJFZEfO72svfw3RGRrZ5f3s76lInKNfXuWiHwcxHVfKSLvBGt9HYh7tohsF5FyEflOJ9fx/0TkzmDnppQ/WsA7QUTyRWRyd+fRQETmiEidiJTZf9tE5C8i0r9hHmPMR8aYEQGu64X25jPGnG+MeTYIueeIiBGR+CbrXmCM+Z+urrsT/g/4izEm1Rjzr5YT7de9yt7GR0TkUxG5TkQaP0fGmOuMMfcEEkxELheRlSJSISKF9u0bREQCWHaivd3+2uLxj0Vkln17lojU219ITf+Oa3HfZz+vhvtX+ok3X0Rqm7zHNorI/SKSHshztdfhyOcm3D6foaQFPHosNMakAb2A7wL9gNVNi3gwiCVa3zfHA5vamWe6vZ2PBx4AfgP8raOBRORW4BHgD1ivVV/gOuBsICHA1VQAV4tIThvzLLe/kJr+7Wt6H9htP6+Gxxa0sq4H7efeG/gBMB74RERSAsxXBVm0fhC7hYj0FJE3ReSgiBy2b2e3mG2oiPxXREpE5HUR6dVk+fF2q+6IiKwTkYkdzcEYU2eM2QRcBhwEbrXXPVFECprE+o2I7LVbU1tF5FwRmQL8FrjMbomts+fNFZF7ReQToBIY4qdLR0TkMft5bRGRc5tMaNYiatHK/9D+f8SOeVbLLhkRmSAiq+x1rxKRCU2m5YrIPSLyif1c3hGRrNa2j4j8RER2iMghEXlDRI6zH/8CGAIstvNIbGc7lxhj3rC38zUicrK9nvkiMte+nWW/B47Y8T4SEZfdav0/4AZjzCvGmDJjWWOMudIYU2MvnygifxSR3SLylVjdM0lN0jgCzAfuaivXYDPGVBtjVgEXAplYxRwRGSoi74tIsYgUicgCEcmwpz0PDOLo9v21/fg/ROSA/dp+KCJfa4gjIlNF5HP7dd0rIrObTJsmImvl6C+hUW3FiVZawIPLBfwdq3U2CKgC/tJinquBHwLHAV7gUQARGQC8BczFakXPBl4Vkd6dScQYUw+8Dnyj5TQRGQH8HDjdblGdB+QbY94G7sNqzacaY0Y3WWwmcC2QBuzyE/JMYCeQhVVQXmv65dSGc+z/GXbM5S1y7YW1XR7FKhYPAW+JSGaT2b6PVUT6YLVeZ+OHiEwC7gcuBfrbz+NlAGPMUJq3RGsCyB1jzH+BAvxsZ6wvzwKsFmtfrC9HA5wFJGK9Pm35PXACMAYYBgwAftdinnuBi+3X1FHGmDLgXY4+d8HavscBI4GBwBx73pk0374P2sssBYZjvXafAU1b/38Dfmq/R08G3gcQkdOAZ4CfYr0nngTeEJHENuJEJS3gQWSMKTbGvGqMqbTf3PcC32wx2/PGmI3GmArgTuBSEYkDrgKWGGOWGGN8xph3gTxgahdS2of1ZdBSPVYBOUlE3MaYfGPMF+2sa74xZpMxxmuMqfMzvRB42P4FsBDYClzQhdwbXABsN8Y8b8d+CdgCTG8yz9+NMduMMVXAIqyC58+VwDPGmM/sAn0bcFY7XRCBaG0712F9URxvb5ePjDX4UBZQZIzxNszY5JdXlYicIyIC/AT4hTHmkP1+ug+4vGkAY8wB4P9htej9GW+vt+Gvvde5oxqfuzFmhzHmXWNMjTHmINaXbcv3fzPGmGfsXyA1WMV+tBztV6/Deo/2MMYcNsZ8Zj/+E+BJY8xKY0y9vS+mBqtLJ6ZoAQ8iEUkWkSdFZJeIlGJ1D2TYBbrBnia3dwFurA/08cAlTT9swNexCkBnDQAOtXzQGLMDuAXrA1MoIi83dCW0YU870/ea5iOj7cJqiXXVcRzb4t+F9dwaHGhyuxJIDWRdxphyoLjFujrD73bG6t/eAbwjIjtF5H/tx4uBLGm+43aCMSbDnubCarUnY+3HaHg/vG0/3tLvgfNEZLSfaSuMMRlN/oZ29km2ovG5i0gf+720137/v4D13vZLROJE5AER+cKeP9+e1LDMxVgNmF0i8h8ROct+/Hjg1haflYEE5/0WUbSAB9etwAjgTGNMD452DzQ9qmBgk9uDsFoZRVgF8vkWH7YUY8wDnUlErB2N04GP/E03xrxojPk61ofBYBUB7Nt+F2kn5AC71dhgEFbrDKydbclNpvXrwHr32Tk2NQjY285y7a5LrJ1vmZ1cV8M6TscqYsccRmm3LG81xgzBei1+Kda+geVYLcaL2lh1EVYX3NeavB/S7Z2OLeMUAw8DAR39EiwikgpM5uh77H6s13OU/f6/iubv/Zav9fextsFkIB3IaVg1gDFmlTHmIqzulX9h/boC67Nyb4vPSrL968xfnKilBbzz3CLiafIXj9U/XIW1Q64X/ncuXSUiJ4lIMtbP3lfs/uoXgOkicp7dMvGIteOx5U7QNomIW0RGAi9hFcqH/MwzQkQm2Tvqqu2c6+3JXwE50vEjTfoAN9nxL8HqA11iT1sLXG5PGwfMaLLcQcCHtQPRnyXACSLyfRGJF5HLgJOANzuYH8CLwA9EZIz93O8DVhpj8ju6IhHpISLTsPrQXzDGbPAzzzQRGWZ/sZVibeN6Y8wR4G7gcRGZISKpYu3cHAOkABhjfMBTwJ9FpI+9vgEicl4rKT0ETMDa7iEl1s7VsVhF9TDWfh+w3v/lWO//AcCvWiz6Fc1f5zSsL7JirC/4+5rESBDrfIB0u8uuYfuBtV2uE5EzxZIiIheISForcaKWFvDOW4JV+Br+5mC1gpKwWk8rsH7ytvQ81pEDBwAPcBOAMWYPVmvkt1hFbQ/WByDQ1+gyESnHOjLhDawPxVhjzD4/8yZiHQJXZOfRx44L8A/7f7GIfOZn2dasxNoZVYTV9z/DbhmC1dc/FOvDfjdWIQXAGFNpz/+J/XO4WT+mvY5pWL9uioFfA9OMMUUdyK1hXcvsXF4F9ts5Xd7mQsdaLCJlWK/P7ViF8wetzDsceA+rqC0HHjfG5Nq5PAj8Euv5FGIVnSexDkv81F7+N1hdMCvsLob3sH7h+XtupcCDHNsXf5Ycexz46R18zg1+bT/3Q8BzwGpggr0/B6zX9jSgBGvH82stlr8fuMN+nWfb69iF9Qvoc6zPTFMzgXz7uV+H1aLHGJOH1Q/+F6z31A5gVhtxopYYvaCDUkpFJG2BK6VUhNICrpRSEUoLuFJKRSgt4EopFaHi258leLKyskxOTo6TIZvZag+mOsKBk463FlvBRmQ6foazUirKrF69usgYc8xJXI4W8JycHPLy8pwM2czEidb/3FwHYs23guXOciCYUiqqiYi/8YecLeDdLTv7ekpLS7n55lbP7g2aIS7rPIKbb745oPmHDRvGjTfeGMqUlFJRJqYK+OHDuymvqGT1zvr2Zw6aynbniKv0N4yGUkq1LaYKeFVVKrXeJGpP7MoAf4E5FL8SgF7eM9udN2nLknbnUUqplmKqgFdUpOOtd+bM0yL3+0BgBVwppTpDDyNUSqkIFVMF3OczSOyMNBl2HnvsMR577LHuTkOpqBFTXSg6cFf32rFjR3enoFRUiakCrsLLxIYD81vIzc1tNi03N5fJkyfj9Xpxu928++67x0zvyP32Yp133nnU1NTg8Xh4++23j4n9rW99C2MMLpeL999/v9l9n8/X6byC/Tzau3/VVVdRUFBATk4O8+fP59FHH+W1117jkksuob6+vvH2z372s2Pi5eXl8etf/5o//OEPjB07luLiYu6++27uuusuMjMz2bFjBzfffDOPPPIIw4YNO2b+p556igULFnD11Vfz3HPPtZnnBRdcQEVFBWlpaSxevJjp06dTVlZGeno6r7/+OhdffDHFxcX06dOHRYsW8d3vfpfDhw+TmZlJWloa+fn5DBs2jKeffprZs2eTl5fH+PHjeeCBB7jppptYv349p512Gg899BC///3vWbp0KdOnT+fWW289JpdwE1NdKGlpu0lOyXck1pCqnzOk6ueOxIoFXq91+ci6On+X4wyumhrresbV1dV+Yzf8kmso1i3vR4qCggIA8vPzAXjtNWv47n/84x/NbvszZ84cfD4fd91lXbPk2WefZcOGDY3FeO7cuVRUVDB37ly/8y9YYF27uGnxbk1FhTXceFlZWbP/JSUlABQXW8POFxYWAnD48OHGxxueW8Ovv4YTCVessIYeX79+PQCffWYNfb906VIAFi9e3G5e4cDR8cDHjRtnuvNMzEmTJlHvM5Sf3tr4+90jacsSxg7pyyOPPNLdqYRUw0lNjzzySKstSdU9UlJSGgtlSy1b4Xl5ecyeffQ6CXfddRf3338/tbW1JCYmct9993Hrrbc2Tv/lL3/JQw8dvTDUpEmTeP/990PwLNqWmJjY+OUMkJycTGXl0fM0MjIyOHLkSJvr6K5WuIisNsaMO+bxWCrgZ599MT5jqD37OyGPVRxvXSYw0/uNdudNWfsyaQnCsGHDQp1Wt9qxYwdJSUm88sorWsAjTNPCNW3aNMrLyxvvx8dbPbFer5f4+Hg8Hk+z6SISNfufwq2Ah7wPXESuBa4FGDRoUKjDtammpqdjsYrd1jVuAyngSkWSpsUZjnYxNdxuOT1ainc4CnkBN8bMA+aB1QIPdbxI5PP0YFgMdaGoyJaamhqzLfBwE1M7MZVS/qWkpLQ67ZJLLml2f86cOc3u33777bhcVimJi4vj7rvvbjb9F7/4RbP7kyZN6kKmnZeYmNjsfnJycrP7GRkZTqYTFFrAVbdoqy+x5bRg34/UdYcy1ltvvdXqulseRjhu3DhSU1MBqzX+rW99iylTpiAiTJkyhbFjx9Iw7n9OTg4XXnhhs/l/97vfhex5tLWN/v3vfze7v2RJ8zGI/vWvf3V63d1FC7iKCA0/091ud8hjNbTUPB6P39giAtDY6mx5P1JkZ2cDNBbb733ve4DV4m562585c+bgcrkaW9vXXHMNp5xyCldffTUAd9xxBykpKdxxxx1+57/yyisBGudvS8Ovg7S0tGb/09PTAcjMzASgT58+APTs2bPx8Ybn1nCAwLhx1n7A8ePHAzBq1CgATjvtNADOP/98AKZPn95uXuEgpo5COffc/6Gu3kfF6e2/abrKh3W4kovEduaMncMIG06j13HPleqYbjsKJZx4PG681bWOxAqkcMcaLdxKBVdMFfCKinRq6py5mMPB+GUA9Pae60g8pVTsiakCXlWV6th44Ifd/wW0gCulQiey9roopZRqFFMtcADB4HHgEmauEdYAO0lb249lXROzb4gzUkpFm5gq4PX1g/F6vYwdUhTyWPmeBADGDgmkMPeN+nFQlFLBF1MFPDX1NgCcOFpv3fx1VqxZ0X1ooFKq+zh6HLiIHAR2ORbQvywg9E3wjgvXvCB8c9O8Oi5ccwvXvCA8cjveGNO75YOOFvBwICJ5/g6I727hmheEb26aV8eFa27hmheEd256FIpSSkUoLeBKKRWhYrGAz+vuBFoRrnlB+OameXVcuOYWrnlBGOcWc33gSikVLWKxBa6UUlFBC7hSSkUoLeBKKRWhtIArpVSE0gKulFIRSgu4UkpFKC3gSikVobSAK6VUhNICrpRSEUoLuFJKRSgt4EopFaG0gCulVITSAq6UUhFKC7hSSkUoRy9qnJWVZXJycpwM2czWrdb/ESMciFVsBRuR6UAwpVRUW716dZG/a2I6WsBzcnLIy8tzMmQzEyda/3NzHYg13wqWO8uBYEqpqCYifi8Gr10oSikVobSAK6VUhNICrpRSEcrRPvDutmSJg7GudDCYUhGirq6OgoICqquruzuVsOTxeMjOzsbtdgc0f0wV8ORkB2O5HQymVIQoKCggLS2NnJwcRKS70wkrxhiKi4spKChg8ODBAS0TU10ojz9u/TkSa9XjPL7KoWBKRYjq6moyMzO1ePshImRmZnbo10lMtcAXLYLC0moyxhY2e/z7Zw4KfqxNiwC44fQbgr5upSKZFu/WdXTbxFQLXCmlookWcKVUTBERZs6c2Xjf6/XSu3dvpk2b1qH1TJw4sfHExKlTp3LkyJGg5hmImOpCUUqplJQUNm7cSFVVFUlJSbz77rsMGDCgS+tc4uQhbk1oC1wp1W0mTjz2r+FAg8pK/9Pnz7emFxUdOy1Q559/Pm+99RYAL730EldccUXjtIqKCn74wx9y+umnc+qpp/L6668DUFVVxeWXX86oUaO47LLLqKqqalwmJyeHoqIiAL7zne8wduxYvva1rzFv3rzGeVJTU7n99tsZPXo048eP56uvvgo84VbEVAHPzYU7nihsd76gxJqVq+OgKBWmLr/8cl5++WWqq6tZv349Z555ZuO0e++9l0mTJrFq1So++OADfvWrX1FRUcETTzxBcnIy69ev5/bbb2f16tV+1/3MM8+wevVq8vLyePTRRykuLgasL4bx48ezbt06zjnnHJ566qkuPw/tQlFKdZu2BpZLTm57elZW5wemGzVqFPn5+bz00ktMnTq12bR33nmHN954gz/+8Y+Adejj7t27+fDDD7npppsalx81apTfdT/66KP885//BGDPnj1s376dzMxMEhISGvvZx44dy7vvvtu55JuIqQL+xz/Cmt1pXHBlWehjfWq9+LMnzA55LKVUx1144YXMnj2b3NzcxlYyWCfUvPrqq4zwM+50e4f55ebm8t5777F8+XKSk5OZOHFi43Hdbre7cfm4uDi8Xm+Xn0NMdaG8+Sas+TjJmVjb3uTNbW86Eksp1XE//OEP+d3vfscpp5zS7PHzzjuPxx57DGMMAGvWrAHgnHPOYcGCBQBs3LiR9evXH7POkpISevbsSXJyMlu2bGHFihUhfQ4BF3ARiRORNSLypn1/sIisFJHtIrJQRBJCl6ZSSgVXdnY2N9988zGP33nnndTV1TFq1ChOPvlk7rzzTgCuv/56ysvLGTVqFA8++CBnnHHGMctOmTIFr9fLqFGjuPPOOxk/fnxIn0NHulBuBjYDPez7vwf+bIx5WUT+H/Aj4Ikg56eUUkFVXl5+zGMTJ05kon0YS1JSEk8++eQx8yQlJfHyyy/7XWd+fn7j7aVLl7Ybd8aMGcyYMaMDWfsXUAtcRLKBC4Cn7fsCTAJesWd5FvhOl7NRSikVsEC7UB4Gfg347PuZwBFjTEMvfAHg90h4EblWRPJEJO/gwYNdSrarkpLAnWicieVOIsntTH+7Uio2tduFIiLTgEJjzGoRmdjwsJ9Z/VZGY8w8YB7AuHHjnKmerVi6FF5c6cyXyNIr/f+MUkqpYAmkD/xs4EIRmQp4sPrAHwYyRCTeboVnA/tCl6ZSSqmW2u1CMcbcZozJNsbkAJcD7xtjrgQ+ABp64a8BXg9ZlkFyzz3wz2d6tD9jMGL95x7u+c89jsRSSsWmrhwH/hvglyKyA6tP/G/BSSl0li2DTas8zsT6chnLvlzmSCylVGzqUAE3xuQaY6bZt3caY84wxgwzxlxijKkJTYpKKRU8wRpONhzE1JmYSinVdDhZICjDyXaXmBoLRSkVXibOn3jMY5d+7VJuOP0GKusqmbpg6jHTZ42ZxawxsyiqLGLGouYnwwQ6AmjDcLIzZsxoHE72o48+AqxRA2+88UY2bNiA1+tlzpw5XHTRReTn5zNz5kwqKioA+Mtf/sKECRPIzc1lzpw5ZGVlsXHjRsaOHcsLL7zgyKXjYqoFnpkJqem+9mcMRqzkTDKTMx2JpZTqmM4MJ9unTx/effddPvvsMxYuXNg4MiFY46U8/PDDfP755+zcuZNPPvnEkecRUy3wV1+FF1cWORPr0lcdiaNUJGurxZzsTm5zelZyVqfH3O/McLLHHXccP//5z1m7di1xcXFs27atcZkzzjiD7OxsAMaMGUN+fj5f//rXO5VbR8RUAVdKqQYdHU52zpw59O3bl3Xr1uHz+fB4jh7RlpiY2Hg7WEPFBiKmulBuuw1efjzdmVjv3cZt793mSCylVMd1dDjZkpIS+vfvj8vl4vnnn6e+vt7xnFuKqQK+fDns2JDY/ozBiFWwnOUFyx2JpZTquI4OJ3vDDTfw7LPPMn78eLZt20ZKSorTKR9DGr5lnDBu3DiTl5fnWLyWJk6EwtLqY66L+f0zBwU/lr13Xa+LqdRRmzdvZuTIkd2dRljzt41EZLUxZlzLeWOqBa6UUtFEC7hSSkWomDoKJTsbvEXO7HjI7pHtSBylIo0xxpGTXCJRR7u0Y6qAv/ACvLiyuP0ZgxHrey84EkepSOLxeCguLiYzM1OLeAvGGIqLi5sdntiemCrgSqnulZ2dTUFBAd19da5w5fF4Gk8ICkRMFfBbboGtBzKY+YsjoY/19i0APDzl4ZDHUipSuN1uBg8e3N1pRI2YKuBr10JhaYIzsQ6sdSSOUip26VEoSikVobSAK6VUhNICrpRSESqm+sBPOAFchc6MEnZC5gmOxFFKxa6YKuDz5sGLKw85E2v6PEfiKKVil3ahKKVUhIqpAn7ttfD0/b2cibX4Wq5dfK0jsZRSsSmmulC2bYPCUmee8rbibe3PpJRSXRBTLXCllIomWsCVUipCaQFXSqkIFVN94GPGwNYDtc7E6jfGkThKqdgVUwX84YfhxZWhH4kQdBRCpVToaReKUkpFqHYLuIh4ROS/IrJORDaJyN3244NFZKWIbBeRhSLizDitXXDVVfD4XZnOxHrtKq567SpHYimlYlMgLfAaYJIxZjQwBpgiIuOB3wN/NsYMBw4DPwpdmsFRUACHCuOciVVaQEFpgSOxlFKxqd0Cbizl9l23/WeAScAr9uPPAt8JSYZKKaX8CqgPXETiRGQtUAi8C3wBHDHGNAztVwAMaGXZa0UkT0Ty9Dp4SikVPAEVcGNMvTFmDJANnAGM9DdbK8vOM8aMM8aM6927d+czVUop1UyHDiM0xhwRkVxgPJAhIvF2Kzwb2BeC/ILqrLNg074aZ2Jln+VIHKVU7Gq3gItIb6DOLt5JwGSsHZgfADOAl4FrgNdDmWgw3H8/vLiyxJlYk+93JI5SKnYF0gLvDzwrInFYXS6LjDFvisjnwMsiMhdYA/wthHkqpZRqod0CboxZD5zq5/GdWP3hEePii2HPoSxueaAo9LEWXQzAq5e+GvJYSqnYFFOn0hcXQ3mpMyefFlcWOxJHKRW79FR6pZSKUFrAlVIqQmkBV0qpCBVTfeDnngvrC6qdiTX4XEfiKKViV0wV8DvvhBdXljoT65t3OhJHKRW7tAtFKaUiVEwV8PPPh9/f4sx4LOcvOJ/zF5zvSCylVGyKqS6UqiqoqxFnYtVVORJHKRW7YqoFrpRS0UQLuFJKRSgt4EopFaFiqg982jRYs9uZvulpJ0xzJI5SKnbFVAGfPRteXFnmTKwJsx2Jo5SKXdqFopRSESqmCvjEiTD3+j7OxJo/kYnzJzoSSykVm2KqgCulVDTRAq6UUhFKC7hSSkUoLeBKKRWhYuowwksvhVVfVjoT62uXOhJHKRW7YqqA33ADvLiy3JlYp9/gSBylVOyKqS6UykqoqXZmNMLKukoq65xp7SulYlNMFfCpU+EPv3BmPPCpC6YydcFUR2IppWJTTBVwpZSKJlrAlVIqQmkBV0qpCKUFXCmlIlRMHUY4axYs/6LCmVhjZjkSRykVu9ot4CIyEHgO6Af4gHnGmEdEpBewEMgB8oFLjTGHQ5dq182aBQkrtYArpaJDIF0oXuBWY8xIYDzwMxE5CfhfYJkxZjiwzL4f1oqKoOyIM71GRZVFFFUWORJLKRWb2q1mxpj9xpjP7NtlwGZgAHAR8Kw927PAd0KVZLDMmAGP3JblTKxFM5ixaIYjsZRSsalDzVERyQFOBVYCfY0x+8Eq8oDfKyWIyLUikicieQcPHuxatkoppRoFXMBFJBV4FbjFGFMa6HLGmHnGmHHGmHG9eztzFqRSSsWCgAq4iLixivcCY8xr9sNfiUh/e3p/oDA0KSqllPKn3QIuIgL8DdhsjHmoyaQ3gGvs29cArwc/PaWUUq0J5Djws4GZwAYRWWs/9lvgAWCRiPwI2A1cEpoUg+f66+Hj7c4MJ3v9uOsdiaOUil3tFnBjzMdAa2OwnhvcdELrssugfqUzQ7xedvJljsRRSsWumDqVfs8eKP4qzplYJXvYU7LHkVhKqdgUUwV85kx4Yk6mM7H+OZOZ/5zpSCylVGyKqQKulFLRRAu4UkpFKC3gSikVobSAK6VUhIqp8cBvvRX+s7XMmVhn3epIHKVU7IqpAj59OuztUQq4Qx9rxPSQx1BKxbaY6kJ5dmkx97ywh+1fhb4VvrVoK1uLtoY8jlIqdsVUC/xXt8RTWnUK/xmxgeF900Ia66dv/hSA3Fm5IY2jlIpdMdMC37SvhNIqL+44FzuLKth7uKq7U1JKqS6JmQL+t4+/JM4l9E/3kBjv4sPtenEJpVRki4kCXlhazeJ1++idlki8SzhjcC827i3hUEVtd6emlFKdFhMF/PkVu/D6DP16eACYMDQLlwif7NCLDiulIlfU78SsrqtnwcrdTB7Zl8vOjeP9zUWkJ7kZPTCdvF2HOPdEv5fy7LI7zrkjJOtVSqkGUV/Al20u5FBFLT+YkMOEYVCYVgPAWUOz+Gz3EbYcCM0hhZOHTA7JepVSqkHUd6Fs3FdCvEsYl9OLtWshf5t1Ek//dA8J8S4KjoTmaJS1B9ay9sDa9mdUSqlOivoCvmV/KcP6pJIQ7+KWW+CFP/cEwCXCgIwk9h4OzRV6bnn7Fm55+5aQrFsppSAGCvjm/WWM7N/D77QBGUnsL6mmrt7ncFZKKdV1UV3AD1fUcqC0mhP7+QjzOgwAABU3SURBVD/rckDPJLw+wzYHTq1XSqlgi+oCvvlAKUCrLfDsjCQA1heUOJaTUkoFS1QX8C37rZb1if39t8B7pSTgcbu0gCulIlJUH0a45UApWakJ9EmzTuC57z54Z9ORxukiQnZGMhv2HmltFZ1237n3BX2dSinVVFQX8M37yzix39HukwkTID+u+enzA3om8cmOIqrr6vG444IWe8LACUFbl1JK+RO1XSjeeh/bvipjZJPuk08/hW3rE5rNNyDD2pEZ7BN6Pt3zKZ/u+TSo61RKqaaitoDnF1dQ4/U1a4H/9rew6ImMZvNl97R2ZG4oCG43ym+X/ZbfLvttUNeplFJNRW0B39zODswG6UluslITWKc7MpVSESaKC3gp8S5hWJ/UNucTEU4ZkM4GLeBKqQgTtQV8y4EyhvZOJTG+/R2Tp2RnsL2wjMparwOZKaVUcLRbwEXkGREpFJGNTR7rJSLvish2+3/P0KbZcVv2lzbbgdmW0dnp+Axs2lca4qyUUip4AmmBzwemtHjsf4FlxpjhwDL7ftg4UlnLvpJqTmxxBubDD8NVvzh8zPynZKcDsG5P8HZkPjzlYR6e8nDQ1qeUUi21exy4MeZDEclp8fBFwET79rNALvCbIObVJQ2HBLY8hX7MGPi8pu6Y+fukeeif7gnqGZlj+o0J2rqUUsqfzvaB9zXG7Aew/7d6WRsRuVZE8kQk7+BBZy4kvHm/1RXSchCr996Djf9N9LvMqOx01gfxUML3dr7HezvfC9r6lFKqpZDvxDTGzDPGjDPGjOvdu3eowwGw7asyeia76ZPWvFjPnQv/+nu632VGZWeQX1xJSeWxLfTOmPvhXOZ+ODco61JKKX86W8C/EpH+APb/wuCl1HVbDpQxol8aIhLwMqOzrRN81odgXBSllAqFzhbwN4Br7NvXAK8HJ52u8/kM2w6UMaJvYEegNGjYkakjEyqlIkUghxG+BCwHRohIgYj8CHgA+LaIbAe+bd8PC3uPVFFRW8+Ifv7HAG9NepKbwVkpQT0SRSmlQimQo1CuaGXSuUHOJSgajkAZ0cpVeNoyKjudlTsPBTslpZQKiagbTrbh8mgn9D32FPonn4TF61ov0KOyM3h97T4KS6vp08PTpTyenPZkl5ZXSqn2RN2p9FsOlDEgI4k0j/uYaSNGwHHHt366/OiGE3qC0A8+ImsEI7JGdHk9SinVmqgr4FsPlLZ6EePFi+Gzj5JaXfZrx6UT55Kg9IMv3rqYxVsXd3k9SinVmqjqQqn1+th5sILJI/v6nf6nP0FhaRqnfaPK7/SkhDiG90llXRBO6PnT8j8BMH3E9C6vSyml/ImqFvgXB8vx+kyndmA2GJ2dwYa9JRhjgpiZUkoFX1S1wBt2YHalgI8amM7CvD3sPlTJ8ZkpwUqtTS+u3O338e+fOciR+EqpyBRVLfAtB8qIdwlDstq+iENbGs7I1Cv0KKXCXVQV8K32RRwS4jv/tEb0SyM1MZ7lXxQFMTOllAq+qOpC2XqgjLHHt35tieefh3+tKW5zHe44F988oTfLNhfi8xlcrsDHU2kW67vPd2o5pZQKVNS0wMuq69h7pKrN/u+BAyGzb32765p8Uh8Ky2rYsLfz3SgD0wcyMH1gp5dXSqn2RE0Bb9yB2cYgVgsXwvJ3k9td18QT+uASeG/zV53OZ+HGhSzcuLDTyyulVHuipoAHMgbKE0/Astfa38HZMyWBcTm9eG9z50fJfSLvCZ7Ie6LTyyulVHuipoCv3nWYnslusnu2fqZlR3x7ZF827y+l4HBlUNanlFLBFhUF3BjDpzuKmTA0q0MXcWjL5JOsszmXdaEVrpRSoRQVBXxnUQUHSquZMCwzaOscnJXCkN4pXeoHV0qpUIqKAv7pDuuY7bOHZgV1vd8e2ZcVO4spqw7OdTKVUiqYouI48E92FDMgI4njM9s+wuSVV+DV1YGfoDP5pL48+eFO/rPtINNGHdehnF659JUOza+UUh0V8S3wep9h+c5iJgzNbLf/OysL0jJ8Aa/7tEE96ZOWyPxP8js8uFVWchZZycH9RaCUUk1FfAH/fF8pJVV1nD2s/WI5fz78583AB6iKcwm3TD6BvF2H+femjvWFz187n/lr53doGaWU6oiI70L5xB6zZMLQ9ndgzp8PhaUpfHNaRbvzNowQWO8z9E5L5PZ/buBgWQ0zzzo+oLwaivesMbMCml8ppToq4gv4p18UM7xPapeuYdnacK5gtcLPP7kfzy3fxX+/LA64gAeixlvPPz/by4srd9Gnh4fsnklk90wmNTHiXxallAMiulLUen2s+vIQl50e2jFHRvRNY0hWCsu2FFJaXUcPP9fb7Iiq2nqeX5HP0x99SWFZDWmeeDbtK8UALoFvn9SPc4Zr/7lSqm0RXcDX7D5MVV19QN0nXSEinH9Kf/76wQ7mvLGJP8wYTVwnRyn8sqiC619YzZYDZUwYmsmfLh3N7uJKaut97DtSzfIvivj3pgPsOVTJhWOO83txZqWUggjfifn+1kJcAmcOCW0BBxiQkcSkE/vw2md7uemlNdR42x/VsKV3Nh3gwsc+5kBpNX//wem8+JPxfGN4b0SExPg4BmelcMUZg5h6cj+2HCjlor98wpdF7ffXK6ViU8S2wAtLq3nu012cf3J/0pMCa6UuWQILVx3sdMzJI/syYWgmc9/azJGqWp6cOa7V/uolVy5pvF1SVccf/72V51fsYlR2Oo9feRrZPf0fsy4ifH14bwb0TObVzwq4+IlP+fus0xk9MKPTeXeWv30Depk3pcJHxBbwP72zDa/Px6+njAh4meRkSPR07WLFP/7GEHqlJPCrV9Zz7p9yuWZCDleecTzpyc2/RJLdyRhj+OeaAu59azOHKmr5wdk5/GbKiXjcce3GGZyVwivXncXVz/yXK55awRNXjeWbJ/TuUu7RIhy/WPS6pqo7RGQB37y/lEWr9/DDswd36MLDjz8Oq75M5dszyrsU/3unZTOwVzKPLtvOg29v5bFlO5g4ojeDeiWT3SuZmrp6ntswj93FlVBxHqMHZjD/B2dw8oD0DsUZ0juV166fwDV/X8WP5q/iV+eN4MffGNLp/nelVHSJyAJ+35LN9PC4uXHSsA4tt2gRFJYmd7mAA5ye04vnf3Qmm/eX8szHX5K36zDLNhdSW2+d6Xk4+W3SPPE8Nu3XXDR6QKcvzdanh4eFPx3P7EXruH/pFt7edIA/XjKaob07f+Hm9tT7DHsOVbLzYDml1XVU1flI98STnpxASWXdMb82VHPGGPuIIv2iVaEVcQX8vc+/4qPtRdxxwUgykhMcj+/vp/IfLhkNgM9nKCyrweWCS1/9A4WlNVTV+nh51Z7GeTvzk7qHx82TM8fyxrp9/O71TUx95CNmjM3mknEDGZ2dHpQhdPccquSj7UV8sqOIT74o4kil/wG8/vrBDk7om8qEoVmMH5LJ+CG9Qv46GGP4sqiCdQVH2HqgnA+2FFJSVYfLBXH2DuCdB8s5eUA6Jw9IZ2jvlKANK9yeihov/80/xNIN+9lfUk1JVR0lVXXU1ftITojjueX5HJeRxKjsdE4d1JMx2Rn6BaiCpksFXESmAI8AccDTxpgHgpKVH+U1Xh56ZxvzP/2SIb1TuPqsnFCF6jB/Rb2wtCaoMUSEi8YM4Kwhmfz+7a28srqABSt3M7xPKl8fnsXIfj0Y2b8H2T2TSE9yt9nir6jxsqOwnO2F5Xy2+zAfby9i9yHrwhX9eniYPLIvZ+T0YsuBMtKT3HjcLkqrvBypquW4jCRWfnmIhav2MP/TfETgxH49GD+kFyf178GIfmkM65NKckLn31qVtV427y9l495SVu86zIqdxRSWWdvTHSdkpiTSM9mNz4DPGCpr63luxS5qvdavn14pCZyR04szBvfilOx0TuyXFrTDMWu89azbU2J90e0oYu2eI3h9hjiX0D/dQ58eiZzQN5WEeBflNfVkJLvZXVzJB1sLaRhOZ0jvFMYMzGDMwAyG9U5lcO8U+qZ5Ov0rrS0vrtyNzxi89QafMcTHCTPHH+/YF5wKLenoIE2NC4rEAduAbwMFwCrgCmPM560tM27cOJOXl9fhWEs27OfuxZsoLKvhijMG8ZvzTuxUK2biROvolTueCP1FGuauvAyAO84M7nUxG1rwpdV1vLV+P/9as5c1u480dt2AdTJQckI8WakJuONcuONceH0+KmrqKauuo7Ta2zhvamI844f04uvDsvj68CyG9k5t/HC3dYaq1+ej4FAVO4sq2FlUzt7DVdR4j+aQnuQmKzWBzNRE0hLj8STEkeyOwx3vIt4lxLkEY6C23oe33seRyjoOltdQWFrD/pIqfPbbsk9aot3Sz2RcTk8GZ6Xwj7yCY/Kp9xkOltVQcLiS/OIKviyq4HCTXxGDeiWT3TOJfuke+vXw0CPJTUpiPKmJcSTExREfJ7jjrJy8PoPPZyir8VJSWcfhylp2Hapk24EyviyqwOsziMCoAemcNTSLs4dl8kVhBQnxxx6V2/B6lVXXsaGghDV7jvDmun3sPlxFRc3R1yHeJfRL95CVmkhmSgKpnng7v3gS410kxLlIiHcR5xJEBJdYz7nG66PW66Oy1ktplZeymjoOV1g5H6qo5XBlLXX1zT/jLoGUxHh6pyXSOzWR3mmJ9O3hoW+PRPqkechIdtMzOYH0JDdJCXF44uNIdB993doq/sYYfMZ6f9TVG2rq6qmt91FZW09VbT0VNV4qa+vtPy/VXuv1r6v3YYx15nOcS4iPc+GJd5HojiMx3oXHHdd43x0nuOOO5uOSo/9dAggIR3M0GIz9Ze+tN9TWW9usqs7Kp+Fzkbv1INXeeuq8hjqflZcx1mUa412Cxx1HamI8aZ54eiS5SU9y0yPJTZonnpSEeJIS4khyx9mfuba3U0eJyGpjzLhjHu9CAT8LmGOMOc++fxuAMeb+1pbpbAG/YcFq8osqufe7J3PqoJ6dyheio4D74zOGQxW1HLB/wlfUeqmo8VLr9VHvM3h9BpcIHreLhPg40jzx9E2zPqw9UxKIc4nfrp22CnhL9T4rh69KqzlYXkNZdR3lNfWUV3up9dZTW2+oq7fy8RlDvZ2TyyXECXjccfTwuEn1xNMrJYEBGUkcl5FED098pz8IpVV17C+pYl9JdeO2Ka2qo7zGi9cX+Ps+ziUMyEjihL5pjOiXypHKOoZkpZKU0P7RRK0xxlBa7aWovIaDZTUcqqilvMZ63cprvNR4fXZxrsdbb/WptyXJHUePpHjSPG4yktz0SkmgV0oCBYerSLC/AESsL6cT+6VRVu3loB27sLSar0prqKoL7NwGl1i/CAUQAWPAcLR4R7o4l/VlHu+ytllCnAuvz1BVV9/4Ky/Q9cSJgFjb7M0bv8GwPp3bdxWKAj4DmGKM+bF9fyZwpjHm5y3muxa41r47AtjaqYDBkwUEPii4c8I1Lwjf3DSvjgvX3MI1LwiP3I43xhxzHHFX+sD9NYuO+TYwxswD5nUhTlCJSJ6/b7LuFq55Qfjmpnl1XLjmFq55QXjn1pVT6QuApqNIZQP7upaOUkqpQHWlgK8ChovIYBFJAC4H3ghOWkoppdrT6S4UY4xXRH4O/BvrMMJnjDGbgpZZ6IRNd04L4ZoXhG9umlfHhWtu4ZoXhHFund6JqZRSqntF9HCySikVy7SAK6VUhIrKAi4iU0Rkq4jsEJH/9TP9OhHZICJrReRjETkpXHJrMt8METEi4sjhSwFss1kictDeZmtF5MdO5BVIbvY8l4rI5yKySUReDIe8ROTPTbbXNhE54kReAeY2SEQ+EJE1IrJeRKaGSV7Hi8gyO6dcEcl2KK9nRKRQRDa2Ml1E5FE77/UicpoTebXLGBNVf1g7VL8AhgAJwDrgpBbz9Ghy+0Lg7XDJzZ4vDfgQWAGMC4e8gFnAX8L09RwOrAF62vf7hENeLea/EWtHf7hss3nA9fbtk4D8MMnrH8A19u1JwPMObbNzgNOAja1MnwosxTr/ZTyw0om82vuLxhb4GcAOY8xOY0wt8DJwUdMZjDGlTe6m4OcEpO7KzXYP8CBQHWZ5dYdAcvsJ8FdjzGEAY0zox0ro+Da7AnjJgbwgsNwM0MO+nY4z53AEktdJwDL79gd+poeEMeZD4FAbs1wEPGcsK4AMEenvRG5ticYCPgDY0+R+gf1YMyLyMxH5AqtQ3hQuuYnIqcBAY8ybDuUUUF62i+2fj6+IyEA/00MhkNxOAE4QkU9EZIU9SmY45AVY3QLAYOB9B/KCwHKbA1wlIgXAEqxfCOGQ1zrgYvv2d4E0EQn9RW/bF/Dr7aRoLOCBnuL/V2PMUOA3wB0hz8rSZm4i4gL+DNzqUD6Nof081nKbLQZyjDGjgPeAZ0OelSWQ3OKxulEmYrV0nxaRUF9ENKD3me1y4BVjTMevhN05geR2BTDfGJON1T3wvP3+6+68ZgPfFJE1wDeBvYD3mKWc15HX2zHRWMA7eor/y8B3QprRUe3llgacDOSKSD5WX9sbDuzIbHebGWOKjTENg5w/BYwNcU4B52bP87oxps4Y8yXWgGnDwyCvBpfjXPcJBJbbj4BFAMaY5YAHa9Cmbs3LGLPPGPM9Y8ypwO32YyUhzisQ4Tl0SHd3wodgZ0Q8sBPrJ2vDjpKvtZhneJPb04G8cMmtxfy5OLMTM5Bt1r/J7e8CK8JlmwFTgGft21lYP3Uzuzsve74RQD72SXNhtM2WArPs2yOxilFIcwwwryzAZd++F/g/B7dbDq3vxLyA5jsx/+tUXm3m3N0JhOiFmIp1sYkvgNvtx/4PuNC+/QiwCViLtaOk1SLqdG4t5nWkgAe4ze63t9k6e5udGC7bzP5QPQR8DmwALg+HvOz7c4AHnNpWHdhmJwGf2K/nWuB/wiSvGcB2e56ngUSH8noJ2A/UYbW2fwRcB1zX5D32VzvvDU59Ltv701PplVIqQkVjH7hSSsUELeBKKRWhtIArpVSE0gKulFIRSgu4UkpFKC3gKmbZI8x9LCLnN3nsUhF5uzvzUipQehihimkicjLWCHinYo2WtxaYYoz5olsTUyoAWsBVzBORB4EKrJEpy4wx93RzSkoFRAu4inkikgJ8BtRinWFX084iSoWFTl+VXqloYYypEJGFQLkWbxVJdCemUhaf/adUxNACrpRSEUoLuFJKRSjdiamUUhFKW+BKKRWhtIArpVSE0gKulFIRSgu4UkpFKC3gSikVobSAK6VUhNICrpRSEer/A6lUjN7wrfXcAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"data.label_distribution()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, you can build ML models to predict this association. Also, note that another way to phrase it is as a missing link prediction problem in Gene-Disease Association Network, where you can apply recent Graph ML to do interesting predictions. You can obtain the network object of edge list/DGL/PyG format using TDC data functions. For example, we want to include all associations above 0.35 as edges. Then, to obtain DGL object, type:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The dataset label consists of affinity scores. Binarization using threshold 0.35 is conducted to construct the positive edges in the network. Adjust the threshold by to_graph(threshold = X)\n",
"Using backend: pytorch\n"
]
},
{
"data": {
"text/plain": [
"DGLGraph(num_nodes=9432, num_edges=15484,\n",
" ndata_schemes={}\n",
" edata_schemes={})"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"graph = data.to_graph(threshold = 0.35, format = 'dgl', split = True, frac = [0.7, 0.1, 0.2], seed = 'benchmark', order = 'ascending')\n",
"graph['dgl_graph']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In additiont to predicting GDA, there are also research doing target fishing using drug-target interaction dataset.\n",
"\n",
"### Activity\n",
"\n",
"After we found the target, we want to screen a large set of compounds to identify the ones who have high binding affinity or activity to the disease target. The binding affinity is generated via high-throughput screening. There are huge amounts of wet lab data available out there for various disease targets. Instead of including all of them, TDC aims to include assays for disease of current interest. For example, we include a SARS-CoV2 in vitro data from Touret et al.: "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 101k/101k [00:00<00:00, 626kiB/s] \n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Drug_ID
\n",
"
Drug
\n",
"
Y
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0
\n",
"
CCOC1=CC2=C(C=C1)N=C(S2)S(=O)(=O)N
\n",
"
1
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C)O)CC[C@@...
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Drug_ID Drug Y\n",
"0 0 CCOC1=CC2=C(C=C1)N=C(S2)S(=O)(=O)N 1\n",
"1 1 C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C)O)CC[C@@... 1"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.single_pred import HTS\n",
"data = HTS(name = 'SARSCoV2_Vitro_Touret')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For HTS, we hope to be a community driven effort, where domain people can point out disease of interests and the corresponding assay data and then we could quickly add it to TDC. This would make TDC reflect the state-of-the-art landscape of diseases targets and allow machine learning scientists to build models to aid the development of that disease. If you have any idea, please don't hesitate to [contact us](mailto:kexinhuang@hsph.harvard.edu). \n",
"\n",
"While HTS is restricted to one target protein, drug-target interaction (DTI) dataset combines many assays. One huge advantage of it is that a ML model learned on HTS dataset can only do prediction on one protein whereas a ML model learned on DTI dataset learns both disease proteins and drugs chemicals and thus can generalize to unseen drugs/targets. TDC includes several DTI datasets, including the largest BindingDB dataset. Note that BindingDB is the collection of many assays. Since different assays use different units, TDC separates them as separate datasets. Specifically, it has four datasets with Kd, IC50, EC50, Ki as the units. We load Kd here as an example for the sake of tutorial example (although IC50 has much larger number of data points, ~1Million):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 54.4M/54.4M [00:03<00:00, 16.5MiB/s]\n",
"Loading...\n",
"--- Dataset Statistics ---\n",
"10665 unique drugs.\n",
"1413 unique targets.\n",
"66444 drug-target pairs.\n",
"--------------------------\n",
"Done!\n"
]
}
],
"source": [
"from tdc.multi_pred import DTI\n",
"data = DTI(name = 'BindingDB_Kd', print_stats = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way to find compound that has affinity to disease target is through molecule generation. Molecule generation model is roughly defined as a generative model that generates new molecule structure that achieves some desirable properties such as high binding affinity to a target. There are mainly three diagrams: 1) goal-oriented learning where the ML model generates new molecule individually that achieves high score through oracles; 2) distribution learning aims to learn the distribution of the training set and generates molecule from this learnt distribution; 3) pair molecule generation formulates generation as a translation problem where it is to translate from drug X to Y where X and Y are similar but X has low score and Y has high score. \n",
"\n",
"The datasets for 1 and 2 are any compound library. We provide several compound libraries and oracles in TDC. For compound library, we have MOSES, ChEMBL and ZINC. For example, to load MOSES:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 75.3M/75.3M [00:04<00:00, 18.1MiB/s]\n",
"Loading...\n",
"There are 1936962 molecules \n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
smiles
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
CCCS(=O)c1ccc2[nH]c(=NC(=O)OC)[nH]c2c1
\n",
"
\n",
"
\n",
"
1
\n",
"
CC(C)(C)C(=O)C(Oc1ccc(Cl)cc1)n1ccnc1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" smiles\n",
"0 CCCS(=O)c1ccc2[nH]c(=NC(=O)OC)[nH]c2c1\n",
"1 CC(C)(C)C(=O)C(Oc1ccc(Cl)cc1)n1ccnc1"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.generation import MolGen\n",
"data = MolGen(name = 'MOSES', print_stats = True)\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the same library, different task for goal-oriented and distribution training is defined by different oracles. For example, for goal-oriented, we have an oracle measures the affinity to target DRD2, another task has oracle that measures the affinity to target GSK3B and so on. We use the example of GSK3B here:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 27.8M/27.8M [00:01<00:00, 16.2MiB/s]\n"
]
},
{
"data": {
"text/plain": [
"[0.05, 0.0]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc import Oracle\n",
"oracle = Oracle(name = 'GSK3B')\n",
"oracle(['CCOC1=CC(=C(C=C1C=CC(=O)O)Br)OCC', \n",
" 'CC(=O)OC1=CC=CC=C1C(=O)O'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For all the goal-oriented and generation oracles, please checkout the [TDC oracle webpage](https://zitniklab.hms.harvard.edu/TDC/functions/oracles/). \n",
"\n",
"We also provide three datasets for pair molecule generation DRD2, QED and LogP. For example, to load DRD2 dataset, you can type:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 3.14M/3.14M [00:00<00:00, 3.75MiB/s]\n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
X
\n",
"
Y
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Cn1c(CN2CCN(c3ncccn3)CC2)cc(=O)n(C)c1=O
\n",
"
CC1(C)CC(=O)N(CCCCN2CCN(c3ncccn3)CC2)C(=O)C1
\n",
"
\n",
"
\n",
"
1
\n",
"
C[C@@H](Sc1ncc(-c2ccccc2)n1C)C(=O)N[C@@H]1CCCc...
\n",
"
CC(C(=O)NC1CCCc2ccccc21)N1CCN(c2ccc(F)cc2)CC1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" X \\\n",
"0 Cn1c(CN2CCN(c3ncccn3)CC2)cc(=O)n(C)c1=O \n",
"1 C[C@@H](Sc1ncc(-c2ccccc2)n1C)C(=O)N[C@@H]1CCCc... \n",
"\n",
" Y \n",
"0 CC1(C)CC(=O)N(CCCCN2CCN(c3ncccn3)CC2)C(=O)C1 \n",
"1 CC(C(=O)NC1CCCc2ccccc21)N1CCN(c2ccc(F)cc2)CC1 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.generation import PairMolGen\n",
"data = PairMolGen(name = 'DRD2')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The previous dataset assumes a one-drug-fits-all-patients diagram whereas in reality different patient has different response to the same drug, especially in the case of oncology where patient genomics is a deciding factor for a drug's effectiveness. This is also coined as precision oncology. In TDC, we include Genomics in Drug Sensitivity in Cancer (GDSC) dataset which measures the drug response in various cancer cell lines. In the dataset, we also include SMILES string for the drug and the gene expression for cell lines. There are two versions of GDSC where GDSC2 uses improved experimental procedures. To access the data, for example, type:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 117M/117M [00:06<00:00, 18.8MiB/s] \n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Drug_ID
\n",
"
Drug
\n",
"
Cell Line_ID
\n",
"
Cell Line
\n",
"
Y
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Camptothecin
\n",
"
CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=...
\n",
"
HCC1954
\n",
"
[8.54820830373167, 2.5996072676336297, 10.3759...
\n",
"
-0.251083
\n",
"
\n",
"
\n",
"
1
\n",
"
Camptothecin
\n",
"
CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=...
\n",
"
HCC1143
\n",
"
[7.58193774904993, 2.81430257671695, 10.363326...
\n",
"
1.343315
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Drug_ID Drug \\\n",
"0 Camptothecin CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=... \n",
"1 Camptothecin CC[C@@]1(C2=C(COC1=O)C(=O)N3CC4=CC5=CC=CC=C5N=... \n",
"\n",
" Cell Line_ID Cell Line Y \n",
"0 HCC1954 [8.54820830373167, 2.5996072676336297, 10.3759... -0.251083 \n",
"1 HCC1143 [7.58193774904993, 2.81430257671695, 10.363326... 1.343315 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.multi_pred import DrugRes\n",
"data = DrugRes(name = 'GDSC2')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another important trend is drug combinations. Drug combinations can achieve synergistic effect and improves treatment outcome. In the first version of TDC, we include one drug synergy dataset OncoPolyPharmacology, where it includes experimental results of drug pair combination response to various cancer cell lines. You can obtain it via: "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 1.62G/1.62G [01:29<00:00, 18.1MiB/s] \n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Drug1_ID
\n",
"
Drug2_ID
\n",
"
Cell_Line_ID
\n",
"
Y
\n",
"
Cell_Line
\n",
"
Drug1
\n",
"
Drug2
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
5-FU
\n",
"
ABT-888
\n",
"
A2058
\n",
"
7.693530
\n",
"
[5.291146039856301, 5.040386719464342, 5.29114...
\n",
"
O=c1[nH]cc(F)c(=O)[nH]1
\n",
"
CC1(c2nc3c(C(N)=O)cccc3[nH]2)CCCN1
\n",
"
\n",
"
\n",
"
1
\n",
"
5-FU
\n",
"
ABT-888
\n",
"
A2780
\n",
"
7.778053
\n",
"
[5.291146039856301, 5.040386719464342, 5.29114...
\n",
"
O=c1[nH]cc(F)c(=O)[nH]1
\n",
"
CC1(c2nc3c(C(N)=O)cccc3[nH]2)CCCN1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Drug1_ID Drug2_ID Cell_Line_ID Y \\\n",
"0 5-FU ABT-888 A2058 7.693530 \n",
"1 5-FU ABT-888 A2780 7.778053 \n",
"\n",
" Cell_Line Drug1 \\\n",
"0 [5.291146039856301, 5.040386719464342, 5.29114... O=c1[nH]cc(F)c(=O)[nH]1 \n",
"1 [5.291146039856301, 5.040386719464342, 5.29114... O=c1[nH]cc(F)c(=O)[nH]1 \n",
"\n",
" Drug2 \n",
"0 CC1(c2nc3c(C(N)=O)cccc3[nH]2)CCCN1 \n",
"1 CC1(c2nc3c(C(N)=O)cccc3[nH]2)CCCN1 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.multi_pred import DrugSyn\n",
"data = DrugSyn(name = 'OncoPolyPharmacology')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Efficacy and Safety\n",
"\n",
"After a compound is found to have high affinity to the target disease, it needs to have numerous drug-likeliness properties for it to be delivered safely and efficaciously to the human body. That is good ADME (Absorption, Distribution, Metabolism, and Execretion) properties. ADME datasets are scattered around the internet, there are several great resource on ADME prediction web services, but there is a limited set of organized data for machine learning scientists to build models upon and improve the model performances. In TDC first release, we collect 21 ADME datasets from various public sources such as eDrug3D, AqSolDB, Molecule Net, and various papers supplementary. You can find all the datasets by typing:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['lipophilicity_astrazeneca',\n",
" 'solubility_aqsoldb',\n",
" 'hydrationfreeenergy_freesolv',\n",
" 'caco2_wang',\n",
" 'hia_hou',\n",
" 'pgp_broccatelli',\n",
" 'f20_edrug3d',\n",
" 'f30_edrug3d',\n",
" 'bioavailability_ma',\n",
" 'vd_edrug3d',\n",
" 'cyp2c19_veith',\n",
" 'cyp2d6_veith',\n",
" 'cyp3a4_veith',\n",
" 'cyp1a2_veith',\n",
" 'cyp2c9_veith',\n",
" 'halflife_edrug3d',\n",
" 'clearance_edrug3d',\n",
" 'bbb_adenot',\n",
" 'bbb_martins',\n",
" 'ppbr_ma',\n",
" 'ppbr_edrug3d']"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc import utils\n",
"utils.retrieve_dataset_names('ADME')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As always, you can load and process the data through TDC data loaders. For example, to load the P-glycoprotein Inhibition dataset, type:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 129k/129k [00:00<00:00, 751kiB/s] \n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Drug_ID
\n",
"
Drug
\n",
"
Y
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
3,5,7-Trihydroxy-3',4',5'-trimethoxyflavone
\n",
"
Oc1cc(O)c2c(=O)c(O)c(oc2c1)c1cc(OC)c(OC)c(OC)c1
\n",
"
1
\n",
"
\n",
"
\n",
"
1
\n",
"
3,6,3',4'-Tetramethoxyflavone
\n",
"
COc1cc2c(oc(c(OC)c2=O)c2cc(OC)c(OC)cc2)cc1
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Drug_ID \\\n",
"0 3,5,7-Trihydroxy-3',4',5'-trimethoxyflavone \n",
"1 3,6,3',4'-Tetramethoxyflavone \n",
"\n",
" Drug Y \n",
"0 Oc1cc(O)c2c(=O)c(O)c(oc2c1)c1cc(OC)c(OC)c(OC)c1 1 \n",
"1 COc1cc2c(oc(c(OC)c2=O)c2cc(OC)c(OC)cc2)cc1 1 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.single_pred import ADME\n",
"data = ADME(name = 'Pgp_Broccatelli')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to ADME, the drug has to have low toxicity. We put all of them under the task `Tox`, where we collect Tox21, ToxCast, ClinTox. For Tox21 and ToxCast, they are wet lab results for various toxicity assays. So you can retrieve any of the assay outcome by specifying the assay name. You can find all the assay name and retrieve the corresponding data via:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['NR-AR', 'NR-AR-LBD', 'NR-AhR']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.utils import retrieve_label_name_list\n",
"label_list = retrieve_label_name_list('Tox21')\n",
"label_list[:3]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 712k/712k [00:00<00:00, 1.75MiB/s]\n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Drug_ID
\n",
"
Drug
\n",
"
Y
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
TOX3021
\n",
"
CCOc1ccc2nc(S(N)(=O)=O)sc2c1
\n",
"
0.0
\n",
"
\n",
"
\n",
"
1
\n",
"
TOX3020
\n",
"
CCN1C(=O)NC(c2ccccc2)C1=O
\n",
"
0.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Drug_ID Drug Y\n",
"0 TOX3021 CCOc1ccc2nc(S(N)(=O)=O)sc2c1 0.0\n",
"1 TOX3020 CCN1C(=O)NC(c2ccccc2)C1=O 0.0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.single_pred import Tox\n",
"data = Tox(name = 'Tox21', label_name = label_list[0])\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to using molecule generation oracle for high binding affinity to a target, we can use generation for property improvement. Just simply switching an oracle. For example, for a drug to be synthesizable, we can use the Synthetic Accessibility oracle:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 9.05M/9.05M [00:00<00:00, 10.0MiB/s]\n"
]
},
{
"data": {
"text/plain": [
"[2.206330025677943, 1.580039750008826]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc import Oracle\n",
"oracle = Oracle(name = 'SA')\n",
"oracle(['CCOC1=CC(=C(C=C1C=CC(=O)O)Br)OCC', \n",
" 'CC(=O)OC1=CC=CC=C1C(=O)O'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to individual efficacy and safety, a drug can clash with each other to have adverse effects, i.e. drug-drug interactions (DDIs). This becomes more and more important as more people are taking combination of drugs for various diseases and it is impossible to screen the combination of all of them in wet lab, especially for higher-order combinations. In TDC, we include the DrugBank and TWOSIDES datasets for DDI. For DrugBank, instead of the standard binary dataset, we use the full multi-typed DrugBank where there are more than 80 DDI types:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 44.4M/44.4M [00:02<00:00, 15.6MiB/s]\n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Drug1_ID Drug1 Drug2_ID \\\n",
"0 DB04571 CC1=CC2=CC3=C(OC(=O)C=C3C)C(C)=C2O1 DB00460 \n",
"1 DB00855 NCC(=O)CCC(O)=O DB00460 \n",
"\n",
" Drug2 Y \n",
"0 COC(=O)CCC1=C2NC(\\C=C3/N=C(/C=C4\\N\\C(=C/C5=N/C... 1 \n",
"1 COC(=O)CCC1=C2NC(\\C=C3/N=C(/C=C4\\N\\C(=C/C5=N/C... 1 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.multi_pred import DDI\n",
"data = DDI(name = 'DrugBank')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can get what the label represents by typing:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#Drug1 may increase the photosensitizing activities of #Drug2.\n",
"#Drug1 may increase the anticholinergic activities of #Drug2.\n",
"The bioavailability of #Drug2 can be decreased when combined with #Drug1.\n"
]
}
],
"source": [
"from tdc.utils import get_label_map\n",
"label_map = get_label_map(name = 'DrugBank', task = 'DDI')\n",
"print(label_map[1])\n",
"print(label_map[2])\n",
"print(label_map[3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After finding a safe and efficacious compound, usually a compound lead goes to pre-clinical study and then clinical trials. TDC currently does not support any tasks in these stages, but we are actively looking for including them (e.g. one task coming in a few months is clinical trial outcome prediction). **If you have any dataset related to this, please [contact us](mailto:kexinhuang@hsph.harvard.edu).**\n",
"\n",
"### Manufacturing\n",
"\n",
"After discovering a potential drug candidate, a big portion of drug development is manufacturing, that is how to make the drug candidate from basis reactants and catalysts. \n",
"\n",
"TDC currently includes four tasks in this stage. The first is reaction prediction, where one wants to predict the reaction outcome given the reactants. TDC parses out the full USPTO dataset and obtains 1,939,253 reactions. You can load the data via:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 795M/795M [00:44<00:00, 17.8MiB/s] \n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
input
\n",
"
output
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
[C:1]([C:5]1[CH:10]=[CH:9][C:8]([OH:11])=[CH:7...
\n",
"
[C:1]([CH:5]1[CH2:6][CH2:7][CH:8]([OH:11])[CH2...
\n",
"
\n",
"
\n",
"
1
\n",
"
[Cl-].[Al+3].[Cl-].[Cl-].[Cl:5][CH2:6][CH2:7][...
\n",
"
[Cl:5][CH2:6][CH2:7][CH2:8][C:9]([C:15]1[CH:16...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" input \\\n",
"0 [C:1]([C:5]1[CH:10]=[CH:9][C:8]([OH:11])=[CH:7... \n",
"1 [Cl-].[Al+3].[Cl-].[Cl-].[Cl:5][CH2:6][CH2:7][... \n",
"\n",
" output \n",
"0 [C:1]([CH:5]1[CH2:6][CH2:7][CH:8]([OH:11])[CH2... \n",
"1 [Cl:5][CH2:6][CH2:7][CH2:8][C:9]([C:15]1[CH:16... "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.generation import Reaction\n",
"data = Reaction(name = 'USPTO')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to the forward synthesis, a realistic scenario is one has the product and wants to know what is the reactants that can generate this product. This is also called retrosynthesis. Using the same USPTO dataset above and flip the input and output, we can get the retrosynthesis dataset. A popular smaller dataset is USPTO-50K that is widely used in ML community. USPTO-50K is a subset of USPTO. TDC also includes it:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 5.22M/5.22M [00:00<00:00, 5.57MiB/s]\n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
input
\n",
"
output
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
COC(=O)CCC(=O)c1ccc(OC2CCCCO2)cc1O
\n",
"
C1=COCCC1.COC(=O)CCC(=O)c1ccc(O)cc1O
\n",
"
\n",
"
\n",
"
1
\n",
"
COC(=O)c1cccc(-c2nc3cccnc3[nH]2)c1
\n",
"
COC(=O)c1cccc(C(=O)O)c1.Nc1cccnc1N
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" input output\n",
"0 COC(=O)CCC(=O)c1ccc(OC2CCCCO2)cc1O C1=COCCC1.COC(=O)CCC(=O)c1ccc(O)cc1O\n",
"1 COC(=O)c1cccc(-c2nc3cccnc3[nH]2)c1 COC(=O)c1cccc(C(=O)O)c1.Nc1cccnc1N"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.generation import RetroSyn\n",
"data = RetroSyn(name = 'USPTO-50K')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to reaction predictions, it is also important to predict the reaction condition. One condition is the catalyst. Given the reactants and products, we want to predict the catalyst type. TDC again mines through the USPTO dataset and obtains 1,257,015 reactions with 888 common catalyst types."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 565M/565M [00:35<00:00, 16.1MiB/s] \n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Reactant_ID
\n",
"
Reactant
\n",
"
Product_ID
\n",
"
Product
\n",
"
Y
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
reactant_1
\n",
"
[C:1]([C:5]1[CH:10]=[CH:9][C:8]([OH:11])=[CH:7...
\n",
"
product_1
\n",
"
[C:1]([CH:5]1[CH2:6][CH2:7][CH:8]([OH:11])[CH2...
\n",
"
181
\n",
"
\n",
"
\n",
"
1
\n",
"
reactant_2
\n",
"
[Cl-].[Al+3].[Cl-].[Cl-].[Cl:5][CH2:6][CH2:7][...
\n",
"
product_2
\n",
"
[Cl:5][CH2:6][CH2:7][CH2:8][C:9]([C:15]1[CH:16...
\n",
"
2
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Reactant_ID Reactant Product_ID \\\n",
"0 reactant_1 [C:1]([C:5]1[CH:10]=[CH:9][C:8]([OH:11])=[CH:7... product_1 \n",
"1 reactant_2 [Cl-].[Al+3].[Cl-].[Cl-].[Cl:5][CH2:6][CH2:7][... product_2 \n",
"\n",
" Product Y \n",
"0 [C:1]([CH:5]1[CH2:6][CH2:7][CH:8]([OH:11])[CH2... 181 \n",
"1 [Cl:5][CH2:6][CH2:7][CH2:8][C:9]([C:15]1[CH:16... 2 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.multi_pred import Catalyst\n",
"data = Catalyst(name = 'USPTO_Catalyst')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As in the dataset, we make it machine learning ready, which means the labels are integers values. You can also see what each label index corresponds to by:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"C1COCC1\n",
"C(Cl)Cl\n",
"CN(C=O)C\n"
]
}
],
"source": [
"from tdc.utils import get_label_map\n",
"label_map = get_label_map(name = 'USPTO_Catalyst', task = 'Catalyst')\n",
"print(label_map[1])\n",
"print(label_map[2])\n",
"print(label_map[3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another important factor of drug manufacturing is yields. TDC includes two Yields dataset. One is what we mine through USPTO. But as there is recent research from Schwaller et al. argues that USPTO is a bit too noisy. We thus also includes another dataset used in Schwaller et al., Buchwald-Hartwig. You can obtain it via: "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading...\n",
"100%|██████████| 15.0M/15.0M [00:01<00:00, 11.7MiB/s]\n",
"Loading...\n",
"Done!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Reaction_ID
\n",
"
Reaction
\n",
"
Y
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
reactions_1
\n",
"
{'reactant': 'FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc...
\n",
"
0.106578
\n",
"
\n",
"
\n",
"
1
\n",
"
reactions_2
\n",
"
{'reactant': 'FC(F)(F)c1ccc(Br)cc1.Cc1ccc(N)cc...
\n",
"
0.147479
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Reaction_ID Reaction Y\n",
"0 reactions_1 {'reactant': 'FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc... 0.106578\n",
"1 reactions_2 {'reactant': 'FC(F)(F)c1ccc(Br)cc1.Cc1ccc(N)cc... 0.147479"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tdc.single_pred import Yields\n",
"data = Yields(name = 'Buchwald-Hartwig')\n",
"data.get_data().head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's a long tutorial! Hope you are now pretty familiar with what TDC covers for small molecule drugs. In the next tutorial, we will talk about machine learning datasets for biologics! Also attached here for the next few tutorials:\n",
"\n",
"* [TDC 103 Part 2: Datasets - Biologics](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_103_Datasets_Biologics.ipynb)\n",
"\n",
"* [TDC 104 ML Model Examples with DeepPurpose](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_104_ML_Model_DeepPurpose.ipynb)\n",
"\n",
"* [TDC 105 Molecular Oracles](https://github.com/mims-harvard/TDC/blob/master/tutorials/TDC_105_Oracles.ipynb)\n",
"\n",
"See you there!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:DeepPurpose]",
"language": "python",
"name": "conda-env-DeepPurpose-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}