Optimizing Shuffle in Wide-Area Data Analytics
Shuhao Liu, Hao Wang and Baochun Li
University of Toronto, University of Toronto, University of Toronto

As increasingly large volumes of raw data are generated at geographically distributed datacenters, they need to be efficiently processed by data analytic jobs spanning multiple datacenters across wide-area networks. Designed for a single datacenter, existing data processing frameworks, such as Apache Spark, are not able to deliver satisfactory performance when these wide-area analytic jobs are executed. As wide-area networks interconnecting datacenters may not be congestion free, there is a compelling need for a new system framework that is optimized for wide-area data analytics. In this paper, we design and implement a new proactive data aggregation framework based on Apache Spark, with a focus on optimizing the network traffic incurred in shuffle stages of data analytic jobs. The objective of this framework is to strategically and proactively aggregate the output data of mapper tasks to a subset of worker datacenters, as a replacement to Sparks original passive fetch mechanism across datacenters. It improves the performance of wide-area analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-datacenter links. Our extensive experimental results using standard benchmarks across six Amazon EC2 regions have shown that our proposed framework is able to reduce job completion times by up to 73%, as compared to the existing baseline implementation in Spark.