Increasing Fabric Utilization with Job-Aware Routing
Authors: Jens Domke (Dresden University of Technology)
Abstract: The InfiniBand (IB) technology became one of the most widely used interconnects for HPC systems in recent years. The achievable communication throughput for parallel applications depends heavily on the available number of links and switches of the fabric. These numbers are derived from the quality of the used routing algorithm, which usually optimizes the forwarding tables for global path balancing. However, in a multi-user/multi-job HPC environment this results in suboptimal usage of the shared network by individual jobs. We extend an existing routing algorithm to factor in the locality of running parallel applications, and we create an interface between the batch system and the subnet manager of IB to drive necessary re-routing steps for the fabric. As a result, our job-aware routing allows each running parallel application to make better use of the shared IB fabric, and therefore increase the application performance and the overall fabric utilization.
Two-page extended abstract: pdf