Transparent Fault-Tolerance using Intra-Machine Full-Software-Stack Replication
Giuliano Losa, Antonio Barbalace, Yuzhong Wen, Marina Sadini, Ho-Ren Chuang and Binoy Ravindran
Virginia Tech, Virginia Tech, Virginia Tech, Virginia Tech, Virginia Tech, Virginia Tech

As the number of processors and the size of the memory of computing systems keep growing, the likelihood of CPU core failures and memory errors increases and can threaten system availability. Software components can be hardened against such failures by running several replicas of a component on hardware replicas that fail independently and that are coordinated by a State-Machine Replication protocol. One common solution is to have several identical physical machines to be the replicas of the original one, to provide redundancy. However, it is not always the case that one CPU core failure can bring down the entire system, in such scenario full machine replication may look like an overkill. In this paper, we introduce multithreaded full-softwarestack replication within a single commodity machine. Our approach runs replicas on isolated hardware partitions, each partition has its own CPU cores, memory, and full-softwarestack. A hardware failure on one partition can be recovered by having another partition taking over. We have implemented FT-Linux, a Linux-based operating system that transparently replicates race-free POSIX applications on different hardware partitions of a single machine. We evaluated FT-Linux on several popular Linux applications showing a worst case slowdown due to replication of 20%.