Rapid Replication of Multi-Petabyte File Systems
Student: Justin G. Sybrandt (National Energy Research Scientific Computing Center)
Supervisor: Jason Hick (Lawrence Berkeley National Laboratory)
Abstract: As file systems grow larger, tools which were once industry standard become unsustainable at scale. Today, large data sets containing hundreds of millions of files often take longer to traverse than to copy. The time needed to replicate a file system has grown from hours to weeks, an unrealistic wait for a backup. Distsync is our new utility that can quickly update an out-of-date file system replica. By utilizing General Parallel File System (GPFS) policy scans, distsync finds changed files without navigating between directories. It can then parallelize work across multiple nodes, maximizing the performance of a GPFS. NERSC is currently using distsync to replicate file systems of over 100 million inodes and over 4 petabytes.
Two-page extended abstract: pdf