Beating cuBLAS: Automatically Generating Bespoke Matrix Multiplication Kernels Using GiMMiK
Authors: Freddie D. Witherden (Imperial College London), Bartosz D. Wozniak (Imperial College London), Francis P. Russell (Imperial College London), Peter E. Vincent (Imperial College London), Paul H. J. Kelly (Imperial College London)
Abstract: Matrix multiplication is a fundamental performance primitive ubiquitous in all areas of science and engineering. In this work we present GiMMiK: a generator of bespoke matrix multiplication kernels for block by panel type multiplications where the block matrix is constant. GiMMiK exploits a priori knowledge of this matrix to generate highly performant CUDA code for NVIDIA GPUs. The performance of GiMMiK kernels is particularly apparent when the matrix has some degree of sparsity. GiMMiK embeds matrix entries directly in the code and eliminates multiplies by zeros. Together with the ability of GiMMiK kernels to avoid poorly optimised cleanup code, GiMMiK is able to outperform cuBLAS on a variety of real-world problems. Speedups of 10 times are found on a K40c for a 294 × 1029 matrix with 99% sparsity. It is open source and released under a three clause BSD license.
Two-page extended abstract: pdf