Distributed Fault Tolerant Linear System Solvers based on Erasure Coding
Xuejiao Kang, David F. Gleich, Ahmed Sameh and Ananth Grama
Purdue University--West Lafayette, Purdue University--West Lafayette, Purdue University--West Lafayette, Purdue University--West Lafayette

As parallel and distributed systems scale, fault tolerance becomes an increasingly important problem – particularly on systems with limited I/O capacity and bandwidth. In this paper, we use the recently proposed concept of erasure coded linear systems to develop efficient coding schemes and distributed implementations. Erasure coded computations belong to the class of algorithmic fault tolerance schemes. They are based on augmenting an input dataset, executing the algorithm on the augmented dataset, and in the event of a fault, recovering the solution from the corresponding augmented solution. This process can be viewed as the computational analog of erasure coded storage schemes. The proposed technique has a number of important benefits: (i) as the hardware platform scales in size and number of faults, our scheme yields increasing improvement in resource utilization, compared to traditional schemes; (ii) the proposed scheme is easy to code – the core algorithms remain the same; and (iii) the general scheme is flexible – accommodating a range of computation and communication tradeoffs. We demonstrate our approach in the context of solving sparse linear systems in a distributed environment. We present new coding schemes for augmenting the input matrix that satisfy the recovery equations of erasure coding with high probability in the event of random failures. These coding schemes also minimize fill (non-zero elements introduced by the coding block), while being amenable to efficient partitioning across processing nodes. We demonstrate experimentally that our scheme adds minimal overhead for fault tolerance (significantly better than replicated execution and deterministic replay, at scale), yields excellent parallel efficiency and scalability, and is robust to different fault arrival models.