Advanced Tiling Techniques for Memory-Starved Streaming Numerical Kernels
Authors: Tareq Malas (King Abdullah University of Science & Technology), Georg Hager (Erlangen Regional Computing Center), Hatem Ltaief (King Abdullah University of Science & Technology), David Keyes (King Abdullah University of Science & Technology)
Abstract: Many temporal blocking techniques for stencil algorithms have been suggested for speeding up memory-bound code via improved temporal locality. Most of the established work concentrates on updating separate cache blocks per thread, which works on all types of shared memory systems, regardless of whether there is a shared cache. The downside of this approach is that the cache space for each thread can become too small for accommodating a sufficient number of updates and eventually decouple from memory bandwidth. In this poster we introduce a generalized multi-dimensional intra-tile parallelization scheme for shared-cache multicore processors that results in a significant reduction of cache size requirements. It ensures data access patterns that allow efficient hardware prefetching and TLB utilization. We describe the approach and some implementation details, and we show that our solution is consistently faster than the state-of-the-art stencil frameworks PLUTO and Pochoir.
Two-page extended abstract: pdf