With the recent announcement of our integration of the Enamine Compound Collection into the AI module of CDD Vault to enhance AI-powered similarity searches, the time seems right to take a deeper look into our AI module, and the thoughts that led to its development and continuing evolution.
The AI module in CDD Vault supports researchers managing and interpreting complex datasets, predicting compound behavior, and identifying potential leads with greater precision, and is directly integrated with the Visualization module for multiparameter optimization. We do this through harnessing the power of Neural Networks, specifically using Graph Convolutional Networks, an area of deep learning that processes data structured as graphs. Graph Convolutional Networks bring value to drug discovery through using an atom’s features, as well as its chemical environment, to make predictions and enable exploration of latent vector space.
CDD began work on our AI module to take advantage of the ability of Graph Convolutional Networks to create numerical representations of structures in the so-called latent space. Initially we looked to see how well such an approach represented actual chemistry—whether similar structures were actually close together in that space; and indeed, they are. And from that, it was then a logical next step to consider: Can we use it as a similarity measure, and for similarity search? The answer to both was yes, with extremely fast and efficient searches enabled through use of vector databases.
The Power of Numerical Representation
Our AI module API enables you to use a numerical representation of a compound to build models. Once you have your model, you can explore questions like: If I move from one point to another in my latent vector space what will be the impact on my activity? What structures would correspond? You can create chemical structures around a point and perform analysis to search for improvements by moving in different directions within the latent vector space.
Currently, our deep learning similarity search covers about 33 million structures from ChEMBL, SureChEMBL, and Enamine. Searching these databases allow answering questions like: what is known about my hit, what is the intellectual property space around the structure, and can I purchase similar compounds to expand my SAR knowledge?
We are currently working on ways to incorporate this generative AI component of our module into the process to enable even faster exploration and evaluation within the vector space. With generative AI, you can let the AI module dream up new structures to test and validate. As these new structures are often structurally quite distinct, this can be useful to move away from an existing compound series and identify alternatives.
Adding Bioisosteric Suggestions and Additional Databases
As a more suitable applications for SAR exploration, we added to our AI module bioisosteric suggestions. Here, we use the latent space representation of parts of a structure to identify possible replacements with similar physical and chemical properties. These have the potential to produce similar biological effects. The suggested ideas are augmented with searches in the patent and commercial compound database. This means when evaluating hit lists generated by our AI module, if a molecule of interest is already covered by a patent, you can look for groups of molecules that have similar physical and chemical properties, and produce similar biological effects beyond what is covered by existing patents. Similarly, you can use the AI module to see if you can incorporate compounds already available commercially.
Enabling fast searches is always a benefit in drug discovery. Protection of intellectual property is even more important, which is why our AI module provides both. IP protection comes from the ability to perform searches, including our generative AI, within the secure confines of your CDD Vault—without having to export data to a third-party application.
= =
Dr. Peter Gedeck
Research Informatics Scientist