Understanding the Propagation of Transient Errors in HPC Applications

SESSION: Resilience


EVENT TAG(S): Power, Performance, System Software, Resiliency

TIME: 2:30PM - 3:00PM

SESSION CHAIR(S): Frank Mueller

AUTHOR(S):Rizwan Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald DeMara, Chen-Yong Cher, Pradip Bose



Resiliency of exascale systems has quickly become an important concern for the scientific community.
Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery.

In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.

Chair/Author Details:

Frank Mueller (Chair) - North Carolina State University|

Rizwan Ashraf - University of Central Florida

Roberto Gioiosa - Pacific Northwest National Laboratory

Gokcen Kestor - Pacific Northwest National Laboratory

Ronald DeMara - University of Central Florida

Chen-Yong Cher - IBM Corporation

Pradip Bose - IBM Corporation

