Dynamic Mining and Contextualization of the Scientific Literature. This project creates interactive science articles and collects data metrics, accelerating scientific discovery and reproducibility.

Period of Performance: 04/01/2017 - 03/31/2018

$212K

Phase 1 SBIR

Recipient Firm

Insilico, Inc.
EUGENE, OR 97405
Principal Investigator
Principal Investigator

Abstract

The proposed Dynamic Mining and Contextualization of the Scientific Literature (DMCSL) provides an open lane of communication between authors, science journals, readers, and databases. The outcome of this communication portal will be a database containing mineable metadata for researchers, reagent supply and biotech companies. Data will be available to companies through individualized subscription models. This pipeline identifies biological entities, e.g., gene, alleles, etc., and embeds hyperlinks from these entities to NHGRI-funded curated Model Organism Databases (MODs). DMCSL is an enhancement of a markup pipeline that has been in effect since 2009, and has linked biological entities in over 850 research articles in GENETICS and G3, published by the Genetics Society of America (GSA), to pages in MODs, WormBase, Flybase, and the Saccharomyces Genome Database. This proposal seeks funding to expand the scope of the GSA markup pipeline in all aspects: biological entities linked; authoritative databases linked to (Rat Genome Database; Mouse Genome Information; Zebrafish Model Organism Database; and the fission yeast genome database); and journals linked from. This expansion will also include collecting information on supplies and equipment described in Materials and Method sections of articles along with supplier information. The DMCSL will collect and store link information along with author and journal metadata and link access statistics. By doing so, the DMCSL will provide valuable metrics to all stakeholders, including biotech companies and life science vendors as well as a comprehensive and queryable view of biology not currently available. In Phase I, we will develop code that is flexible enough to scale the pipeline to link an article to more lexica and more databases within a single article and within a strict time limit of turnaround set by the publisher's production process. We will also be testing the software in linking publications of other journals and develop tools to query and data mine relationships identified through the data extraction process. We will develop basic API's to serve as a core API database resource; a linking API to store created links and monitor link activity, and use modern API management to develop a portal for key-based access to other API data. Proving stability and flexibility of the software based on current parameters, in Phase II we will work in collaboration with a wider range of stakeholders, more journals, more databases, including expanding to human biomedical databases, and more companies, to develop experience-based APIs for each stakeholder group. These APIs will be intuitively designed based on how each group interacts with the basic API developed in Phase I, and will be used to develop subscription-based access for commercial companies, access for academic stakeholders and collaborating journals will remain free.