GoBig: A Unified Interface to Big Data Systems

Period of Performance: 02/17/2015 - 11/16/2015


Phase 1 SBIR

Recipient Firm

28 Corporate Drive Array
Clifton Park, NY 12065
Firm POC
Principal Investigator


Problem statement A researcher dealing with big data today is met with a maze of languages, programming environments, data storage and query systems, and compute engines. Pursuing a new path in this space may take years and millions of dollars of investment, only to discover that a new and more applicable big data paradigm has emerged. Costs include learning programming languages, storage systems, and computing paradigms, as well as significant hardware and administrative costs of setting up and maintaining the needed environments for data storage, transfer, and computation. How this problem is being addressed GoBig unifies and simplifies big data tools in two important areas: unified user interface to big data software and hardware stacks, and streamlined deployment and modularity to various types of cloud and HPC systems. Data is managed through the extensible Girder data framework, an open-source project started at Kitware which provides a unified interface to many distributed storage systems along with access control and extensible plugins. Romanesco manages analyses and workflows that span programming language boundaries. The results are then persisted in Girder to be made available for further analysis or visualization. Instead of managing and supporting multiple user endpoints to various big data toolchains, user management and authorization for multiple systems may be managed by GoBigs account credentials. What is to be done in Phase I To demonstrate the feasibility of the GoBig system in Phase I, we will show system modularity by extending computation support in GoBig to Hadoop, HPC clusters running MPI, a queueing system, and a distributed data system. We will also add Julia, Java, and Scala to the analytic programming languages supported in GoBig, and demonstrate the applicability of GoBig to a computational science domain. Our Phase I work will also demonstrate ease of deployment including provisioning of arbitrary systems and easy installation on cloud services such as OpenStack and Amazon Web Services (AWS). This will all be performed utilizing Kitwares proven practices for agile, durable, and sustainable software. Commercial applications and other benefits Because GoBig is open-source and extensible, the community that will grow around the aforementioned tools will foster agility and innovation while reducing maintenance cost over time. The development model used for open-source projects has also been proven to scale to thousands of developers while maintaining a high standard for quality. We will encourage the participation of developers who can add abstractions for more data storage and processing systems. GoBigs flexibility and ease of use will ultimately impact a broad range of data analysts who require a low barrier of entry to distributed compute services, including government, academia, and the business community. Key words Analytics, Big Data, Software, Open Source Summary for members of Congress As the needs for big data storage and processing have escalated dramatically in recent years, a powerful but unwieldy set of disparate tools have appeared that are difficult to utilize. Our proposed platform, GoBig, addresses this by exposing multiple big data storage and computation platforms from a convenient, unified interface.