High performance recovery for parallel state machine replication
Odorico Mendizabal, Fernando Luis Dotti and Fernando Pedone
FURG, PUCRS, University of Lugano

State machine replication is a fundamental approach to high availability. Despite the vast literature on the topic, relatively few studies have considered the issues involved in recovering faulty replicas. Recovering a replica requires (a) retrieving and installing an up-to-date replica checkpoint, and (b) restoring and re-executing the log of commands not reflected in the checkpoint. Parallel techniques to state machine replication render recovery particularly challenging since throughput under normal execution (i.e., in the absence of failures) is very high. Consequently, the log of commands that need to be applied until the replica is available is typically large, which delays recovery. In this paper, we present two techniques to optimize recovery in parallel state machine replication. The first technique allows new commands to execute concurrently with the execution of logged commands, before replicas are completely updated. The second technique introduces on-demand state recovery, which allows segments of a checkpoint to be recovered concurrently.