School of Computer Science Professor Works with Microsoft Research to Make Data Transformation Easier
The growing field of self-service data transformation took a big step forward with the Transform-Data-by-Example (TDE) service. TDE works as a search engine for data transformation libraries, alleviating the difficulty of data wrangling.
The project was at an early stage of conception at Microsoft Research when School of Computer Science Assistant Professor Xu Chu joined and helped contribute to its success.
“Data transformation is a big part of data cleaning, which is very time consuming and expensive,” Chu said.
When data comes from different sources or is manually entered, it’s often inconsistent and challenging to work with until it’s prepared. Data preparation involves cleaning, standardizing, and transforming raw data sets so they can be analyzed effectively. Data scientists can spend up to 80 percent of their time just transforming data.
Developers have created custom code libraries for tasks, such as name parsing and address standardization, that data scientists might need for transformation. Yet these libraries are only useful if the data scientist can find them. Finding them hasn’t always been easy – until now.
TDE indexes thousands of functions from GitHub and Stackoverflow, so users only need to provide their desired output for a few input examples to find the transformation program they need. Currently, TDE has a 72 percent accuracy rate for synthesizing correct transformation programs.
The front-end of TDE is a Microsoft Excel plug-in that users can download from Office. Once the user provides a few input/output examples, TDE connects with the back-end on Microsoft Azure’s cloud to search thousands of functions and synthesize programs using relevant functions that will work for the user. This leverages techniques from the program synthesis field.
“This is a great example of how technologies from non-database domains can help with hard data management problems such as data cleaning,” Chu said.
He believes this type of research has a lot of potential. For example, Chu is working on a project of using matrix and tensor factorization techniques in statistics and machine learning to do data cleaning.
The work on TDE was presented at the Very Large Databases conference in Rio di Janeiro in late August. Chu coauthored the paper Transform Data by Example (TDE): An Extensible Search Engine for Data Transformations with Microsoft Research’s Surajit Chaudhuri, Kris Ganjam, Yeye He, and Vivek Narasayya, and Twitter’s Yudian Zheng. Earlier, a demo of the work was presented at SIGMOD 2018 in Houston.