sponsored byACMIEEE The International Conference for High Performance 
Computing, Networking, Storage and Analysis
FacebookTwitterGoogle PlusLinkedInYouTubeFlickr

SCHEDULE: NOV 15-20, 2015

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

Practical Fault Tolerance on Today's HPC Systems

SESSION: Practical Fault Tolerance on Today's HPC Systems

EVENT TYPE: Tutorials

EVENT TAG(S): Resiliency

TIME: 8:30AM - 12:00PM

Presenter(s):Kathryn Mohror, Nathan DeBardeleben, Laxmikant V. Kale, Eric Roman



The failure rates on high performance computing systems are increasing with increasing component count. Applications running on these systems currently experience failures on the order of days; however, on future systems, predictions of failure rates range from minutes to hours. Developers need to defend their application runs from losing valuable data by using fault tolerant techniques. These techniques range from changing algorithms, to checkpoint and restart, to programming model-based approaches. In this tutorial, we will present introductory material for developers who wish to learn fault tolerant techniques available on today’s systems. We will give background information on the kinds of faults occurring on today’s systems and trends we expect going forward. Following this, we will give detailed information on several fault tolerant approaches and how to incorporate them into applications. Our focus will be on scalable checkpoint and restart mechanisms and programming model-based approaches.

Chair/Presenter Details:

Kathryn Mohror - Lawrence Livermore National Laboratory

Nathan DeBardeleben - Los Alamos National Laboratory

Laxmikant V. Kale - University of Illinois at Urbana-Champaign

Eric Roman - Lawrence Berkeley National Laboratory

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar