SCHEDULE: NOV 15-20, 2015

Accelerating Big Data Processing with Hadoop, Spark, and Memcached on Modern Clusters

SESSION: Accelerating Big Data Processing with Hadoop, Spark, and Memcached on Modern Clusters

EVENT TYPE: Tutorials

EVENT TAG(S): Clouds and Distributed Computing

TIME: 1:30PM - 5:00PM

Presenter(s):Dhabaleswar K. (DK) Panda, Xiaoyi Lu, Hari Subramoni



Apache Hadoop and Spark are gaining prominence in handling Big Data and analytics. Similarly, Memcached in Web-2.0 environment is becoming important for large-scale query processing. Recent studies have shown default Hadoop, Spark, and Memcached can not leverage the features of modern high-performance computing clusters efficiently, like Remote Direct Memory Access (RDMA) enabled high-performance interconnects, high-throughput and large-capacity parallel storage systems (e.g. Lustre). These middleware are traditionally written with sockets and do not deliver best performance on modern high-performance networks. In this tutorial, we will provide an in-depth overview of the architecture of Hadoop components (HDFS, MapReduce, RPC, HBase, etc.), Spark and Memcached. We will examine the challenges in re-designing networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand, iWARP, RoCE, and RSocket) with RDMA and storage architectures. Using the publicly available software packages in the High-Performance Big Data (HiBD, http://hibd.cse.ohio-state.edu) project, we will provide case studies of the new designs for several Hadoop/Spark/Memcached components and their associated benefits. Through these case studies, we will also examine the interplay between high-performance interconnects, storage systems (HDD and SSD), and multi-core platforms to achieve the best solutions for these components and Big Data applications on modern HPC clusters.

Chair/Presenter Details:

Dhabaleswar K. (DK) Panda - Ohio State University

Xiaoyi Lu - Ohio State University

Hari Subramoni - Ohio State University

