Dela - Sharing Large Datasets between Hadoop Clusters
Alexandru A. Ormenisan and Jim Dowling
KTH Royal Institute of Technology, KTH Royal Institute of Technology

Big data has, in recent years, revolutionised an ever-growing number of fields, from machine learning to climate science to genomics. The current state-of-the-art for storing large datasets is either object stores or distributed filesystems, with Hadoop being the dominant open-source platform for managing ‘Big Data’. Existing large-scale storage platforms, however, lack support for the efficient sharing of large datasets over the Internet. Those systems that are widely used for the dissemination of large files, like BitTorrent, need to be adapted to handle challenges such as network links with both high latency and high bandwidth, and scalable storage backends that are optimised for streaming and not random access. In this paper, we introduce Dela, a peer-to-peer data-sharing service integrated into the Hops Hadoop platform that provides an end-to-end solution for dataset sharing. Dela is designed for large-scale storage backends and data transfers that are both non-intrusive to existing TCP network traffic and provide higher network throughput than TCP on high latency, high bandwidth network links, such as transatlantic network links. Dela provides a pluggable storage layer, implementing two alternative ways for clients to access shared data: stream processing of data as it arrives with Kafka, and traditional offline access to data using the Hadoop Distributed Filesystem. Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of large datasets.