sponsored byACMIEEE The International Conference for High Performance 
Computing, Networking, Storage and Analysis
FacebookTwitterGoogle PlusLinkedInYouTubeFlickr

SCHEDULE: NOV 15-20, 2015

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

Understanding the Propagation of Transient Errors in HPC Applications

SESSION: Resilience

EVENT TYPE: Papers

EVENT TAG(S): Power, Performance, System Software, Resiliency

TIME: 2:30PM - 3:00PM

SESSION CHAIR(S): Frank Mueller

AUTHOR(S):Rizwan Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald DeMara, Chen-Yong Cher, Pradip Bose

ROOM:19AB

ABSTRACT:

Resiliency of exascale systems has quickly become an important concern for the scientific community.
Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery.

In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.

Chair/Author Details:

Frank Mueller (Chair) - North Carolina State University|

Rizwan Ashraf - University of Central Florida

Roberto Gioiosa - Pacific Northwest National Laboratory

Gokcen Kestor - Pacific Northwest National Laboratory

Ronald DeMara - University of Central Florida

Chen-Yong Cher - IBM Corporation

Pradip Bose - IBM Corporation

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar


Paper provided by the ACM Digital Library

Paper also available from IEEE Computer Society