Fault-scalable Virtualized Infrastructure Management
Mukil Kesavan, Ada Gavrilovska and Karsten Schwan
VMware Inc., Georgia Institute of Technology, Georgia Institute of Technology

Large-scale virtualized datacenters require considerable automation in infrastructure management in order to operate efficiently. Automation is impaired, however, by the fact that deployments are prone to multiple types of subtle faults due to hardware failures, software bugs, misconfiguration, crashes, performance degraded hardware, etc. Existing Infrastructure-as-a-Service (IaaS) management stacks incorporate little to no resilience measures to shield end users from such cloud providerlevel failures and poor performance. This paper proposes and evaluates extensions to IaaS stacks that mask faults in a fault-agnostic manner while ensuring that the overheads can be proportional to observed failure rates. We also demonstrate that infrastructure automation services and end-user applications can use service-specific knowledge, together with our new interface, to achieve better outcomes.