AccFFT: A New Parallel FFT Library for CPU and GPU Architectures
Student: Amir Gholami (The University of Texas at Austin)
Supervisor: George Biros (The University of Texas at Austin)
Abstract: We present a new distributed-FFT library. Despite the extensive work on FFTs, we achieve
significant speedups. Our library uses novel all-to-all communication algorithms to overcome this barrier. These schemes are modified for GPUs to effectively hide PCI-e overhead. Even though we do not use GPUDirect technology, the GPU results are either better or almost the same as the CPU times (corresponding to 16 or 20 CPU cores). We present performance results on the Maverick and Stampede platforms at the Texas Advanced Computing Center (TACC) and on the Titan system at the Oak Ridge National Laboratory (ORNL). Comparison with P3DFFT and PFFT libraries show a consistent $2-3\times$ speedup across a range of processor counts and problem sizes. Comparison with FFTE library (GPU only) shows a similar trend with $2\times$ speedup. The library is tested up to 131K cores and 4,096 GPUs of Titan, and up to 16K cores of Stampede.
Two-page extended abstract: pdf