An Analysis of Network Congestion in the Titan Supercomputer's Interconnect
Student: Jonathan Freed (University of South Carolina)
Supervisor: Saurabh Gupta (Oak Ridge National Laboratory)
Abstract: The Titan supercomputer is used by computational scientists to run large-scale simulations. These simulations often run concurrently, thus sharing system resources. A saturated system can result in network congestion—negatively affecting interconnect throughput. Our project analyzed data collected by testing the throughput between different node pairs. In particular, we searched for correlations when the throughput was low. We investigated the direct path between the two test nodes, as well as the neighborhood of nodes that are one connection away from the direct path. For each set of nodes, we analyzed the effects of the distance between the nodes, the number of “busy” nodes (nodes currently allocated to applications), and the applications that were running. By understanding application interference, we can develop job-scheduling strategies that lower such interference and lead to more efficient use of Titan’s resources and faster computations for researchers.
Two-page extended abstract: pdf