Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata
Mahmoud Ismail, Ermias Gebremeskel, Theofilos Kakantousis, Gautier Berthou and Jim Dowling
KTH - Royal Institute of Technology, RISE SICS, RISE SICS, RISE SICS, KTH - Royal Institute of Technology

Hadoop is a popular system for storing, managing, and processing large volumes of data, but it has bare-bones internal support for metadata, as metadata is a bottleneck and less means more scalability. The result is a scalable plaform with rudimentary access control that is neither user- nor developer- friendly. Also, metadata services that are built on Hadoop, such as SQL-on-Hadoop, access control, data provenance, and data governance are necessarily implemented as eventually consistent services, resulting in increased development effort and more brittle software. In this paper, we present a new project-based multi-tenancy model for Hadoop, built on a new distribution of Hadoop that provides a distributed database backend for the Hadoop Distributed Filesystem’s (HDFS) metadata layer. We extend Hadoop’s metadata model to introduce projects, datasets, and project-users as new core concepts that enable a user-friendly, UI-driven Hadoop experience. As our metadata service is backed by a transactional database, developers can easily extend metadata by adding new tables and ensure the strong consistency of extended metadata using both transactions and foreign keys.