SBIR Phase I: Data Squashing for Massive Data Analysis

Period of Performance: 01/01/2003 - 12/31/2003


Phase 1 SBIR

Recipient Firm

Ornarose, Inc.
30 Sunset Drive
Chatham, NJ 07928
Principal Investigator


This Small Business Innovation Research Phase I project focuses on massive datasets containing millions or even billions of data points. Statistical analyses of data on this scale present new computational challenges. Squashing algorithms compress massive datasets into much smaller ones so that outputs from statistical analyses carried out on the smaller (squashed) datasets reproduce outputs from the same statistical analyses carried out on the original datasets. Squashing represents an alternative to sampling as a way of dealing with massive data and aims to significantly outperform sampling in terms of predictive and inferential accuracy. Squashing affords several advantages: (1) Computationally intensive statistical procedures such as non-linear modeling or large-scale variable selection, infeasible when directly applied to the massive dataset, become feasible when applied to the squashed representation; (2) Since the squashed dataset may be several orders of magnitude smaller than the original massive dataset, electronic data dissemination becomes much simpler; (3) Because squashed datasets are synthetic (they contain no actual data points), they pose no disclosure risk. The objective of this research is to critically evaluate squashing, and, contingent on a satisfactory evaluation, develop a commercial data squashing software product.