Clustering Software for Biomedical Applications

Period of Performance: 09/30/2007 - 09/29/2008


Phase 2 SBIR

Recipient Firm

Insightful Corporation
Seattle, WA 98109
Principal Investigator


DESCRIPTION (provided by applicant): We propose to provide clustering software for very large databases and for categorical data. Investigators in virtually all areas of research seek to discover patterns and relationships in data. Computer intensive exploratory analysis, or data mining, is having a huge impact in science and industry (e.g. Berkhin 2002, Maitra 2002). However, the availability of software for obtaining partitions and for their visualization lags far behind the proliferation of proposed methods and the growth in size of available databases. We believe that implementing new algorithms for clustering of large datasets that may include non-numeric attributes, and visualizing cluster properties will open new opportunities for data analysis. In Phase I, we developed scalable implementations of clustering methods, including k-means and its extensions to categorical and mixed mode data, and demonstrated that we could discover things about data through a combination of clustering and visualization that neither alone could provide. Our ultimate goal in Phases II and III is to develop a modular addition to the S-PLUS language called S+CLUSTER that provides the following key features: - A suite of clustering algorithms suitable for large and possibly high-dimensional datasets that may include categorical attributes; - Extensive capabilities for visual data exploration of the results of clustering; and - Tools for validation and diagnostics facilitating objective assessment of clustering results. We intend to create software that is flexible and easy to use, and which should enable the analysis and understanding of data from a wide range of applications. Clustering or unsupervised classification has been used in genetics research, protein classification, psychiatric research, analysis of biomedical signals, segmentation of medical images, etc. The software will be part of an integrated environment for data analysis, and it will permit the customization of the clustering process, which will extend the ability of biomedical researchers to understand complex data. New insights into microarrays, epidemiological data and protein database may have high potential in drug discovery, disease diagnosis, and treatment.