A performance-improved implementation of ScaLAPACK implemented in Open Community Runtime

Period of Performance: 01/01/2014 - 12/31/2014

$150K

Phase 1 SBIR

Recipient Firm

ET International, Inc.
100 White Clay Center Drive
Newark, DE 19711
Principal Investigator
Firm POC

Abstract

Software development faces many challenges as high-performance computing (HPC) moves towards exascale: scalability, programmability, performance portability, resilience, and energy efficiency. To tackle these challenges, fundamental shifts in the basic execution models and programming models employed by HPC software are required. In particular, commonly used linear algebra libraries performance suffers due to bulk synchronization and poor data distribution. Further, they will not scale to exascale. One execution model that promises to tackle these challenges is the codelet execution model, for which Open Community Runtime (OCR) is an effective implementation that has been given particular focus within the DOE X-Stack program. We have shown that event-driven execution models can improve performance of key linear algebra functions up to 80% on clusters over the current commercial linear algebra libraries used in commercial and government scientific endeavors. In Phase I, we will implement a data distribution and scheduling framework that enables efficient asynchronous intra- and inter-function fine-grained data distribution, computation, and synchronization. We will implement the singular value decomposition function (PDGESVD) in OCR as a proof of concept of the framework. In Phase II, we will implement the other remaining functions and extend those implementations to GPUs where applicable. Although the Phase II effort will have many more functions to implement, it will be less work per function because there will be a general template and framework to follow from the Phase I effort. By providing the ScaLAPACK library in OCR, we will provide the industry with a performance- improved commonly used library. This will allow the less-experienced HPC user or developer to gain performance improvements without code changes, and it will greatly increase adoption of OCR in government, academia, and industry. Further, the techniques derived in this process can be used as a template for further performance optimizations of scientific applications.