The Future of Open Source Data & Drug Discovery
Based on a discussion with Ashley Farley, from the Bill and Melinda Gates Foundation Open Access Group, Andrew Leach from EMBL-EBI, the provider of [chEMBL], Evan Bolton from the MCBI/NLM of the NIH, the provider of PubChem, and CDD.
This is Part-2 of our 2-part series on open-source data and its impact on drug discovery. Read Part-1, Why Is Publicly Available Data Important And How Is It Being Used? to learn more.
The vastness of open-source data is a relatively new development in drug discovery.
Recent decades have seen the amount of data increase and transform the way researchers seek out information. But, this also presents new challenges for the development of methods for collaboration.
Research published in the Journal of Computer-Aided Molecular Design discusses how a major challenge in the future will be how databases and software approaches handle larger amounts of data as it accumulates from high throughput screening and enables the user to draw insights, enable predictions, and move projects forward.
When there is so much data available, a big issue arises in how that data is accessed and made available to researchers who can benefit from it.
If you think back before PubChem existed, there were only a couple of different resources which were open and available.
There was the National Cancer Institute distributed therapeutics program, which had about a quarter of a million chemical structures.
Before 2004, this NCI database was all you really had available to you.
There was also the Maybridge collection, which had about 80,000 chemicals for sale.
Otherwise, you would just subscribe to the Available Chemical Directory, or some other type of paid service, to find out what chemicals were around and available for you to buy.
Now, fast forward.
Here we are.
There are about 100 million unique small molecules available.
You have very large projects - big data type projects - where you can purchase almost any molecule that is potentially synthesizable.
There are large virtual chemical libraries, where you just have to ask for a chemical and someone will make it for you. It's so different from the past and thus, presents unique challenges and potential benefits.
According to research published in the Journal of Cheminformatics, one of the most significant and immediate benefactors of open data will be chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods.
However, making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans, rather than consumption by machine learning algorithms.
There is so much more available to scientists, and research can move much faster.
But, what does the future of this open-source data look like?
What is the long-term value to society of these public data repositories?
How can we make the future of open source data even better than it is today?
Here, we explore 3 areas of interest regarding the future of open-source data and what we need to do today to secure the future of open source data...
-
What is needed to ensure the long-term value of open source data?
It's unknown how information will be accessible or useful in the future.So, how do we prepare to ensure the long-term value of open source data? Or, how do we get the most value out of the open source data that is available today?Well, first there is data science. Data science needs huge amounts of information in order for it to work, so there needs to be more and more information available for the repositories to gather and distribute this information content. There also needs to be high-quality information, which highlights the need for curation. Content can be curated based on the science, or curated as a function of time. For example, if you think of an experiment that was run back in the 1980s, would you trust that experiment or would you rerun it today? Science is constantly evolving, so having high-quality information made available will help ensure the value of open source data well into the future. The major value of open source data that still needs to be fully realized is the possibility of having everything that happened before available to you, so that it can then fuel discoveries going forward at a faster rate than ever possible before. It would be very, very helpful - but, we need appropriate metadata.We need high-quality information for that process. We need biocuration to make that happen, and we need to pull it all together.
-
What are some upcoming changes that will improve data usage?
chEMBL's key project for a number of months now is a completely newly designed web interface that is better for users. Along with the interface redesign, there are a number of behind-the-scenes changes relating to the way the data is set up. Then, looking more broadly, there are questions about the different types of data that could be incorporated into chEMBL. For example, recent exploration is looking at how bioactivity data from patents might be extracted and added to the chEMBL database. Additionally, as experimental platforms change the types of data and the scale at which data is generated, the way data is stored and found would change as well. It's possible that all this will then lead to new ways in which these data feed into databases, including the applications of AI and machine learning.If you think about the dynamic landscape of what's happening with data science around web technologies and beyond, it's really a very, very interesting time and the next 3-5 years will be even more interesting. At PubChem, there are oftentimes tens of thousands, or hundreds of thousands, of literature links to a single chemical. Figuring out how this massive amount of information can best be summarized is one focus of future changes at PubChem. One near-future change being implemented in PubChem to address this is to introduce a view called co-occurrence, where you can find other chemicals that are often mentioned relative to this other chemical. It will also be possible to view diseases related to a chemical (treat or cause), to give you some sense of the types of disease that are commonly associated with a chemical. And, a similar co-occurrence view will also be available for genes and proteins. The thought here is that a researcher could ask questions relative to a given disease and find out the types of information that we know about that disease, relative to PubChem. Then, that person would be able to investigate the bioactivities relative to this disease, the other genes and targets associated to this disease, the chemicals that may treat this disease or may cause this disease, and all the articles that back this information up. The idea is to stitch together the fabric and the ecosystem of all available data and from whom it originates. As you start to think about the ecosystem that chemists, biologists, drug discovery scientists, pharmacologists, toxicologists, and environmental scientists all care about, the next step is to wrap it all together in a package that they can then access and download. In the future, there will be more moving away from analysis-type tools and shifting more toward data views and pre-computed information that are in line with what the users are hoping for, or trying to find out, because it's just too much content for a human to figure this out anymore. And, using data science type approaches can make things a little bit more obvious to the interactive user. The future looks positive. As more information becomes available and more metadata is made available, everybody wins. These and other changes will allow researchers to access that content and do more with it. As long as the researcher can find what they need, we all win because we make more discoveries quicker, better, and faster.
-
How will open-source data make the future of drug discovery better than it is today?
One development that would really make a key difference is for there to be a much more seamless way in which data could be made visible and accessible through open resources. It's quite a time-consuming process to create these resources, and it would take effort from a number of areas, but it's not an insurmountable problem. It's as much of a cultural problem as a technical one, and so it's certainly feasible to anticipate that it is achievable. Another fantastic development would be a program that can tell you, for every new article that comes out, what information is new and what information is old. This would allow you to be able to determine if you are supporting older information, or if you are somehow in conflict with older information. It would allow researchers to get a better sense of the state of science and understand where scientists agree and where they disagree. The next step could then be a computer that would start to tell you the types of experiments needed to resolve certain types of conflicts in the available data. Imagine if an AI program could say, "Hey, someone needs to run this experiment, because of this gap in the data." Filling in these knowledge gaps will allow for the ability to plan ahead and start doing more sensible experiments that bring about progress at a much faster rate.
-
Summary
As technology advances and scientists are able to produce more and more data at faster and faster rates, the accessibility of that data becomes paramount. Questions about what is needed to sustain the future of open-source data and how to improve the usability of data are key.
This blog is authored by members of the CDD Vault community. CDD Vault is a hosted drug discovery informatics platform that securely manages both private and external biological and chemical data. It provides core functionality including chemical registration, structure activity relationship, chemical inventory, and electronic lab notebook capabilities.