SC15 Austin, TX

Mitigation of Failures in High Performance Computing via Runtime Techniques

Student: Xiang Ni (University of Illinois at Urbana-Champaign)
Advisor: Laxmikant Kale (University of Illinois at Urbana-Champaign)
Abstract: The number of components assembled to create a supercomputer keeps increasing in pursuit of more computational power required to enable breakthroughs in science and engineering. However, the reliability and the capacity of each individual component has not increased as fast as the increase in the total number of components. As a result, the machines fail frequently and hamper smooth execution of high performance applications. This thesis strives to answer the following questions with regard to this challenge: how can a runtime system provide fault tolerance support more efficiently with minimal application intervention? What are the effective ways to detect and correct silent data corruption? Given the limited memory resource, how do we enable the execution and checkpointing of data intensive applications?

Summary: pdf
Presentation: pdf
Poster: pdf

Doctoral Showcase Index