iNFORMER: A MapReduce-Like Data-Intensive Processing Framework For Native Data Storage And Formats

Period of Performance: 01/01/2015 - 12/31/2015

$604K

Phase 2 STTR

Recipient Firm

RNET Technologies, Inc.
240 West Elmwood Dr Suite 2010
Dayton, OH 45459
Firm POC
Principal Investigator

Research Institution

The Ohio State University
1330 Kinnear Road
Columbus, OH 43212
Institution POC

Abstract

The majority of Big Data applications have been built around the MapReduce paradigm. Despite the popularity of MapReduce, there are several obstacles to applying it for some commercial and scientific use-cases. This includes the requirement to load data into specialized file systems, like HDFS, in addition to long intermediate data shuffle and sorting phases involving storage, which will impose significant performance penalties and memory stress. The project will develop a Native data FOrmat MapREDuce-like framework, iNFORMER, based on OSUs Sci- Mate project. The framework allows MapReduce-like applications to be executed over data stored in a native data formats and filesystems, without loading the data into another file system such as HDFS. The product will have a low-overhead MATE processing engine with in-situ/in-memory processing capability, and a Virtual Data Integrator (VDI) data access and integration interface. These components can also be integrated into existing popular Big Data platforms such as HPCC Systems or Hadoop to allow for direct access to data in native formats. In Phase I and study was conducted to extend MATE with in-situ data analysis, with results showing superior performance to Apache Spark. An existing version of MATE was also integrated into YARN to show feasibility for Big Data platform integration. For the data integration side, a parallel data loading module was developed to directly load compressed XML data from Amazon S3 into HPCC Systems Thor processing engine. In Phase II MATE prototype will be completed as a processing engine with support for batch and in-situ/in- memory data analysis, and compared to the state of the art, such as Hadoop and Spark. It will also be integrated into Big Data platforms. In Phase II, VDI will also be completed, including the API and storage adapters, to support MATE with access to native data formats. It will also be integrated into HPCC Systems and Hadoop. Commercial Applications and Other Benefits: iNFORMER tools will directly benefit scientific and government communities that store data in special data formats on parallel (and usually detached/network) filesystems. Particularly formats such as NetCDF or HDF5 are used by a wide range of users including groups from academia, industry, and national laboratories. The other benefit particularly for scientific communities is the existence of MATE as a light-weight, low-overhead, and high- performance data analysis engine, rather than transforming data into complicated environments such as Hadoop. Commercial systems such as HPCC Systems or Hadoop can also benefit from iNFORMER VDI to ingest data from various data formats on non-platform-specific filesytems, whether in relational databases, weblogs, Cloud storage systems, or network parallel file systems. This allows them to reach far beyond their current addressed market, while improving their current customers experience.