Applying Resampling Techniques to Large Data Sets

Period of Performance: 08/01/2001 - 07/31/2003


Phase 2 SBIR

Recipient Firm

Public Data Queries, Inc.
Chelsea, MI 48118
Principal Investigator


This project will investigate the feasibility and merit of applying bootstrapping and similar resampling strategies to the analysis of relatively large census and survey microdata files. While bootstrapping has in general been applied most fruitfully to small sample research designs, new technology now allows resampling and bootstrapping to be effectively applied to much larger data sets than have been previously analyzed using the techniques. In particular, we will focus on two aims: (l) determining confidence intervals for frequency counts, percentages, and summary statistics for basic multivariate analyses from large census and survey data files; and (2) assessing the potential for resampling techniques to assist in masking sensitive information extracted from data sets in which confidentiality of the respondents (disclosure avoidance) is an important concern and where minimal perturbing of the data is desired. A computational tool utilizing an existing parallel high performance computing environment and optimized for resampling will be created to facilitate the implementation and testing of resampling techniques such as bootstrapping on data sets of 10,000-50,000 records. PROPOSED COMMERCIAL APPLICATION: Incorporating resampling into our own information system, PDQ-Explore, will increase its value to data users in the fields of social science, health care, community services, and commercial information. Licensing the software to other information providers who need confidence intervals will broaden our customer base. Protecting confidentiality will allow us to tap more data sources and make more data sets available to more users, in research, education, government, and commerce.