JeCache: Just-Enough Data Caching with Just-in-Time Prefetching for Big Data Applications
Yifeng Luo, Jia Shi and Shuigeng Zhou
Fudan University, Fudan University, Fudan University

Big data clusters introduce an intermediate cache layer between the computing frameworks and the underlying distributed file systems, to enable upper-level applications or end users to access big datasets from cache and share them between different computing frameworks efficiently. As cache resources are concurrently shared by multiple applications or end users, applying existing on-demand caching strategies would result in intense contention, where a prohibitively big dataset is cached as a whole. While big data applications usually involve massive numbers of file scan operations, a cached-in data block would has feeble chance of being accessed before it is cached out by other numerous on-demand cached data blocks, and thus it is unwise to cache a block long before it is actually accessed. In this paper, we propose a novel just-enough big data caching scheme with a just-in-time block prefetching mechanism to improve the cache efficiency for big data clusters. With the just-in-time block prefetching, a block is cached in right before the task begins to process the block, rather than being cached in along with other blocks of the same dataset being accessed. We monitor block accesses to measure the average processing time of data blocks, and then compute the minimal number of blocks that should be kept in cache for a big dataset, so that the speed of data processing matches with that of data prefetching, and each upper-level task can obtain its input block from cache just in time. Our experimental results show that the just-in-time prefetching could restrain the requirement of cache resources for big data applications, and provides them with the same level of performance speedup as all data blocks are cached.