Learning from failure across multiple clusters: A trace-driven approach to understanding, predicting, and mitigating job terminations
Nosayba El-Sayed, Hongyu Zhu and Bianca Schroeder
MIT, University of Toronto, University of Toronto

In large-scale computing platforms, jobs are prone to interruptions and premature terminations, limiting their usability and leading to significant waste in cluster resources. In this paper, we tackle this problem in three steps. First, we provide a comprehensive study based on log data from multiple largescale production systems to identify patterns in the behaviour of unsuccessful jobs across different clusters and investigate possible root causes behind job termination. Our results reveal several interesting properties that distinguish unsuccessful jobs from others, particularly w.r.t. resource consumption patterns and job configuration settings. Secondly, we design a machine learningbased framework for predicting job and task terminations. We show that job failures can be predicted relatively early with high precision and recall, and also identify attributes that have strong predictive power of job failure. Finally, we demonstrate in a concrete use case how our prediction framework can be used to mitigate the effect of unsuccessful execution using an effective task-cloning policy that we propose.