Workflow Optimization in PAW
Maxim Filatov and Verena Kantere
UNIGE, University of Geneva

Many industrial applications, from domains such as telecommunication, web and sales, require to perform complex analytics across several data processing systems. The performance of such analytics is usually expressed in workflows, and it is a task that is both labor-intensive and time-consuming. At the same time, with increasing amounts of data to be analysed, the optimization of analytics workflows becomes crucial for satisfying business objectives. This paper focuses on workflow optimization with respect to time efficiency, over multiple execution engines, such as a traditional DBMS, a MapReduce engine, and a scripting engine. This configuration is emerging as a common paradigm used to combine analysis of unstructured and structured data. We propose a novel optimization technique as part of our system called PAW (Platform for Analytics Workflows). This technique creates alternative workflow structures and their execution plans based on equivalent combinations and orders of operators. The technique employs an exhaustive and a heuristic algorithm to search efficiently the space of equivalent workflow structures and select the one with the optimal execution plan. We present a thorough experimental study and we showcase the efficiency of the proposed optimization technique in a fully fledged multiengine system, applied on three real-world applications and their data, as well as on a synthetic benchmark.