BioHDF - Open Binary File Standards for Bioinformatics

Period of Performance: 03/17/2009 - 02/28/2010


Phase 2 STTR

Recipient Firm

Geospiza, Inc.
Seattle, WA 98107
Principal Investigator


DESCRIPTION (provided by applicant): The first wave of Next Generation ("Next Gen") sequencing technologies combines molecular resolution with extremely high throughput to dramatically reduce sequencing costs and increase assay sensitivity and specificity. These technologies will provide large numbers of laboratories with "Genome Center" levels of throughput to make discoveries and develop new assays never before imagined. However, widespread adoption of Next Gen will be hindered because current bioinformatics programs do not scale;they are inefficient in data storage, processing, and memory utilization. The most popular programs typically copy and recopy data to new files many times during processing, require that all data be maintained in random access memory (RAM) when running, and cannot incrementally process data. To overcome these issues, fundamental changes in data management and processing are needed. Geospiza and The HDF Group are collaborating to develop portable, scalable, bioinformatics technologies based on HDF5 (Hierarchical Data Format ). We call these extensible domain-specific data technologies "BioHDF." BioHDF will implement a data model that supports primary DNA sequence information (reads, quality values, and meta data) and results from sequence assembly and variation detection algorithms. BioHDF will extend HDF5 data structures and library routines with new features (indexes, additional compression, and graph layouts) to support the high performance data storage and computation requirements of Next Gen Sequencing. BioHDF will include APIs, software tools, and a viewer based on HDFView to enable its use in the bioinformatics and research communities. Using BioHDF, researchers will be able perform whole genome shotgun sequencing (WGS), "tag and count" experiments (EST analysis, promoter mapping, DNA methylation, functional mapping), and variation analysis;they will also be able to export datasets in formats accepted by the key databases to publish their work. As a programming environment, BioHDF can be easily extended to accept data from new data collection platforms, and format data for interchange with many databases. Core BioHDF tools will be delivered to the research community as an open source technology. Geospiza will use BioHDF in its Finch. line of products to deliver software systems and applications to support clinical research, diagnostics, and other relevant activities that rely on genetic data. PUBLIC HEALTH RELEVANCE: The overall goal of the BioHDF Phase II project is to make it possible for medical research and clinical communities to take full advantage of the latest DNA sequencing platforms in their efforts to improve public health. Geospiza and The HDF Group will build on their expertise in Laboratory Information Management Systems and high- volume, high-complexity scientific data management systems to create and deliver bioinformatics software systems that can handle the massive amounts of data produced by the latest sequencing instruments. The integrated systems will keep track of collected samples, sequence data, DNA tests, and other laboratory records and biological data associated with the entire sequencing and analysis process, and make it easy for clinicians to use the technology to do their work.