Biocomputation across distributed private datasets to enhance drug discovery

Period of Performance: 04/01/2017 - 03/31/2018


Phase 2 SBIR

Recipient Firm

Collaborative Drug Discovery, Inc.
Burlingame, CA 94010
Principal Investigator


Project Summary Collaborative Drug Discovery (CDD) proposes to develop technology that will vastly simplify and integrate all the processes required to exploit predictive models for drug discovery. The software will make it easy for scientists without specialized training in informatics to create, train, apply, evaluate, share, and archive models with minimal effort, and also leverage a large library of pre-computed models with zero effort. The software will also enable scientists working in different organizations to collectively build models from their aggregated data and share these models, without sharing the underlying training data. Our goal is to democratize the role in drug discovery of computational models ? which have historically been restricted to computational experts ? and allow models to become routine aids to the discovery workflow in academia, foundations, government laboratories, and small companies that do not have the resources to employ them today. In Phase 2 we implemented modified Bayesian model building directly within CDD?s web- based CDD Vault platform, which securely hosts structure-activity relationship (SAR) data; any user can now easily train a Bayesian model with experimental data stored in her private Vault, then apply the model to predict activity for untested compounds. In Phase 2B we propose to generalize this capability with the following new Specific Aims, which are needed to achieve a widespread scientific and commercial impact: Aim 1: Integrate a suite of diverse computational techniques (such as QSAR, Neural Networks, Support Vector Machines, Random Forest, k-Nearest Neighbors, and possibly others) into a single framework, to allow direct side-by-side comparison. Aim 2: Develop and validate a universal metric that ranks the predictive strength of each method as applied to a particular dataset. Aim 3: Apply the metric to automatically generate thousands of models from high-quality, public-access structure-activity and ADME/Tox datasets and present key results to the user. Aim 4: Develop a novel capability to build models collaboratively, by aggregating multiple datasets, and share the models without revealing the compounds and data in the training sets.