Based on a discussion with Ashley Farley, from the Bill and Melinda Gates Foundation Open Access Group, Andrew Leach from EMBL-EBI, the provider of [chEMBL], Evan Bolton from the MCBI/NLM of the NIH, the provider of PubChem, and CDD.
This is Part-1 of our 2-part series on open-source data and its impact on drug discovery. Read Part-2, What Does The Future Of Open Source Data Hold For Drug Discovery? to learn more.
There is great interest in increasing the quantity and quality of shared drug discovery information.
More open access data aids researchers throughout the drug discovery process, prevents them from "reinventing the wheel" and helps with research focused on neglected diseases.
In 2006, CDD introduced a public section to the otherwise private CDD Vault.
There are over 2.5 million compounds and associated data on the CDD platform for others who want to share and publish data. On the CDD website, there is a public access section where the Vault's public section is available.
Additionally, the structures available in CDD public data are also visible in PubChem, with links back to the bioactivity data in CDD public. CDD continues to add and expand the publically available data that exists within CDD Vault.
The value of large, open-source databases to society is huge and, as a scientific community, we are constantly improving the state of these databases.
The large, highly curated chEMBL database and the massive PubChem repository for chemical and biological data are 2 more examples of public drug discovery databases that researchers can access.
It would be great to get to a point where all competitive drug discovery information can be made available so that those who are trying to discover, whether it be the citizen scientist or a large multinational drug discovery company, can access it.
As we move toward this end goal of more open and accessible information, it's important to optimize systems for both computers and humans to be able to access the data.
As published in the Journal of Cheminformatics, the most significant immediate benefactors of open data are chemical algorithms, which are capable of absorbing and presenting concise insights to working chemists, on a scale that could not be achieved by traditional publication methods.
But, to achieve the benefits of these digital chemical algorithms - that can synthesize and present data insights quickly - will require a paradigm shift in the way individual scientists translate their data into digital form.
Currently, most scientists enter their data in a way that is designed for presentation to humans rather than consumption by machine learning algorithms. Extra annotation of text and figures is required by scientists to make this data consumable by algorithms, but the extra effort required to complete this annotation is off-putting for scientists.
One solution to this issue, published by CDD, is a hybrid system that combines machine learning based on natural language processing, and a simplified user interface designed to help scientists curate their data with minimum effort.
Removing the barrier for scientists to record their data in a way that data algorithms can interpret is a first step toward creating a massive and open searchable database that can be accessed by scientists everywhere.
As a society, it would be very beneficial to capture all that biocuration and make sure that we're constantly building on top of it, as opposed to reinventing the wheel or doing the same thing over and over again.
In addition to being about searching these large databases for biological targets, it's also possible to do molecule queries. For example, in chEMBL it's possible to search for a particular compound of interest or to perform substructure-based queries.Once the compounds of interest are identified, a researcher can retrieve additional bioactivity data or other information they want. What's key is that it's all integrated into this core resource. Within PubChem, there are about 95 million small molecule chemicals integrated inside the system and almost 30 million chemicals that have some degree of annotation. When looking up a molecule in PubChem, the database will summarize pretty much everything that's known about a particular chemical, because it's integrated with a number of other knowledge bases. PubChem has over 600 contributors of content.PubChem is focused on putting all the data in one spot so that rather than having to navigate tens of different sites, it is possible to look at the information that's available in one spot. Plus, researchers can see exactly where that content came from and then link back to that other website if interested in getting more information.
How do we get to the point where we can use these high-powered tools and large databases to answer questions through open data? It's a lot of work to get to that point and there are a lot of tough questions and current barriers that we have to work through in order to ensure that we have future success in this area. Being able to innovate and quicken this kind of drug discovery is one of our goals. But, it's important that we continue to drive forward in making drug discovery data more widely available.
This blog is authored by members of the CDD Vault community. CDD Vault is a hosted drug discovery informatics platform that securely manages both private and external biological and chemical data. It provides core functionality including chemical registration, structure activity relationship, chemical inventory, and electronic lab notebook capabilities.