BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:2.0 BEGIN:VEVENT DTSTART:20151117T231500Z DTEND:20151118T010000Z LOCATION:Level 4 - Lobby DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: Many temporal blocking techniques for stencil algorithms have been suggested for speeding up memory-bound code via improved temporal locality. Most of the established work concentrates on updating separate cache blocks per thread, which works on all types of shared memory systems, regardless of whether there is a shared cache. The downside of this approach is that the cache space for each thread can become too small for accommodating a sufficient number of updates and eventually decouple from memory bandwidth. In this poster we introduce a generalized multi-dimensional intra-tile parallelization scheme for shared-cache multicore processors that results in a significant reduction of cache size requirements. It ensures data access patterns that allow efficient hardware prefetching and TLB utilization. We describe the approach and some implementation details, and we show that our solution is consistently faster than the state-of-the-art stencil frameworks PLUTO and Pochoir. SUMMARY:Advanced Tiling Techniques for Memory-Starved Streaming Numerical Kernels PRIORITY:3 END:VEVENT END:VCALENDAR