Indexing large scientific data

Period of Performance: 01/01/2014 - 12/31/2014

$1.5MM

Phase 2 SBIR

Recipient Firm

Map Large, Inc.
PO BOX 8482
Atlanta, GA 31106
Principal Investigator

Abstract

Hadoop style systems have done an excellent job of providing scalable long term disk bound data storage and enjoy wide acceptance in both Government and the private sector. However, Hadoop implementations suffer from performance limitations with respect to whole set aggregates and real time interactivity that we believe can be solved by optimizing for local memory operations. The key performance driver is memory locality. A well written Hadoop process might sometimes achieve optimal memory throughput on an individual node, but the overall system does not generally result in optimal memory locality and thus frequently fails performance requirements. We propose to create a multi node data architecture that automatically optimizes for memory locality using a compressed column oriented architecture compatible with both CPU and GPU processing. The result will be a real time streaming architecture capable of indexing and querying large volumes of heterogeneous scientific data stored on clusters of cloud computers. The resulting system will be highly useful in the In-Memory Analytics and Data Discovery market (US). Gartner projects this market to reach $1 billion by 2013. This market is currently growing about 30% per year.