Bootstrapping Background Knowledge to Arbitrate Data Integrity Issues Within Large Volumes of Data

Period of Performance: 04/28/2011 - 10/28/2011

$100K

Phase 1 SBIR

Recipient Firm

Stottler Henke Associates
1650 South Amphlett Boulevard, Suite 300
San Mateo, CA 94402
Principal Investigator

Abstract

As intelligence and sensor data acquisition technologies improve and expand, the difficulties of maintaining data integrity across vast amounts of data continue to plague researchers. Generally considered to be a problem of computational scalability, we also recognize that a much greater challenge lies in developing and maintaining background knowledge that can be used to move beyond traditional data integrity checks, in an effort to identify and resolve more complex inconsistencies. With our proposed system, called Arbiter, we seek to exploit the hidden opportunity posed by very large data sources in three ways: (1) constructing pseudo-genomes for each entity instance to rapidly identify likely matches, leveraging lightweight ontology alignment heuristics to efficiently identify high-confidence alignment opportunities; (2) leveraging data redundancy to autonomously learn the background knowledge necessary to facilitate the detection of complex relational inconsistencies; and (3) validating entity instance matches with a wide range of heuristics in combination with the acquired background knowledge to resolve higher levels of uncertainty. Phase I prototyping will draw on existing software components, allowing rapid progress.