Restrospective Lightweight Distributed Snapshots Using Loosely Synchronized Clocks
Aleksey Charapko, Ailidani Ailijiang, Murat Demirbas and Sandeep Kulkarni
SUNY Buffalo, SUNY Buffalo, SUNY Buffalo, Michigan State University

In order to take a consistent snapshot of a distributed system, it is necessary to collate and align local logs from each node to construct a pairwise concurrent cut. By leveraging NTP synchronized clocks, and augmenting them with logical clock causality information, Retroscope provides a lightweight solution for taking unplanned retrospective snapshots of past distributed system states. Instead of storing a multiversion copy of the entire system data, this is achieved efficiently by maintaining a configurable-size sliding window-log at each node to capture recent operations. In addition to instant and retrospective snapshots, Retroscope also provides incremental and rolling snapshots that utilizes an existing full snapshot to reduce the cost of constructing a new snapshot in proximity. This capability is useful for performing stepwise debugging and root-cause analysis, and supporting data-integrity monitoring and checkpoint-recovery. We provide implementations of Retroscope for the Voldemort distributed datastore and Hazelcast in-memory data grid, and evaluate their performance under varying workloads.