Least Angle Regression

Period of Performance: 12/01/2008 - 09/30/2010


Phase 2 SBIR

Recipient Firm

Seattle, WA 98109
Principal Investigator


DESCRIPTION (provided by applicant): This SBIR project aims to produce superior methods and software for classification and regression when there are many potential predictor variables to choose from. The methods should (1) produce stable results, where small changes in the data do not produce major changes in the variables selected or in model predictions; (2) produce accurate predictions; (3) facilitate scientific interpretation, by selecting a smaller subset of predictors which provide the best predictions; (4) allow continuous and categorical variables; and (5) support linear regression, logistic regression (predicting a binary outcome), survival analysis, and other types of regression. This project is based on least angle regression, which unifies and provides a fast implementation for a number of modern regression techniques. Least angle regression has great potential, but currently available software is limited in scope and robustness. The outcome of this project should be software which is more robust and widely applicable. This software would apply broadly, including to medical diagnosis, detecting cancer, feature selection in microarrays, and modeling patient characteristics like blood pressure. Phase I work demonstrates feasibility by extending least angle work in three key directions-categorical predictors, logistic regression, and a numerically-accurate implementation. Phase II goals include extensions to other types of explanatory variables (e.g. polynomial or spline functions, and interactions between variables), to survival and other additional regression models, and to handle missing data and massive data sets. This proposed software will enable medical researchers to obtain high prediction accuracy, and obtain stable and interpretable results, in high-dimensional situations. Predicting outcomes based on covariates, determining which covariates most affect outcomes, and adjusting treatment effects estimates for covariates, are among the most important problems in biostatistics. Prediction and feature selection are particularly difficult when there are more possible features than samples; gene microarrays and protein mass spectrometry are extreme examples of this, producing thousands to millions of measurements per sample. LARS excels at feature selection; the proposed software should enable medical researchers to obtain stable and interpretable models with better prediction accuracy in high-dimensional situations.