Extracting Semantic Knowledge from Clinical Reports

Period of Performance: 01/01/2009 - 12/31/2010

$423K

Phase 2 SBIR

Recipient Firm

Logical Semantics, Inc.
Indianapolis, IN 46202
Principal Investigator

Abstract

DESCRIPTION (provided by applicant): Analyzing and processing free-text medical reports for data mining and clinical data interchange is one of the most challenging problems in medical informatics, yet it is crucial for continued research advances and improvements in clinical care. Natural language processing (NLP) is an important enabling technology, but has been held back because it is difficult to understand human language, since it requires extensive domain knowledge. In Phase I, we developed new statistical and machine learning methods that apply domain specific knowledge to the semantic analysis of free-text radiology reports. The methods enabled the creation of two new prototype applications - a SNOMED CT (Systematized Nomenclature of Medicine--Clinical Terms) coding service called SnomedCoder, and a text mining tool for analyzing a large corpus of medical reports, called DataMiner. In Phase II, we will accomplish the following specific aims: 1) Improve the semantic extraction methods developed in Phase I, 2) Expand the semantic knowledge base and classify at least two million new unique sentences from multiple medical institutions, 3) Provide a SNOMED CT auto coding service (alpha service) to participating Indiana Health Information Exchange hospitals, and 4) Build a commercial version of the DataMiner software, and test its functionality using researchers at the Regenstrief Institute. These scientific innovations will revolutionize the ability of health care researchers to analyze vast repositories of clinical information currently locked up in electronic medical records, and correlate this data with new biomedical discoveries in proteonomics and genomics. The ability to codify text rapidly will extend the potential for clinical decision support beyond its narrow base of numeric and structured medical data, and enable SNOMED CT to become a useful coding standard. Phase III will offer coding and data mining services to healthcare payers (both private and government), pharmaceuticals, and academic researchers. A key advantage of our approach over other NLP systems is that we attempt to codify all the information in the report and not just a limited subset, and insist on expert validation which provides a high degree of confidence in the accuracy of the coded data.Project Narrative