College of Computing News

New Training Data Labeling System for Machine Learning Helps Developers

Machine learning (ML) has become one of the most prominent forms of data analysis for everything from fraud detection to visual quality control. Yet the analytic results can often suffer from insufficiently labeled training data.

A team of Georgia Tech researchers has created a system that allows users to more effectively label a training dataset with higher accuracy than current methods.

“We are looking at the problem from a data management perspective,” said School of Computer Science (SCS) Assistant Professor Xu Chu. “In contrast to a lot of ML research that tries to tackle the lack of sufficient training data from an ML algorithm design perspective, we aim at building a system that helps users effectively label a dataset.”

The system, called GOGGLES, labels datasets using affinity coding, a paradigm that allows ML engineers to use various affinity functions that input two unlabeled examples and output a real-valued score.

“You can think of affinity as similarity,” said Chu. “The core premise of the work is that two examples share the same label if they are similar according to some affinity functions (or similarity functions).”

The benefits of affinity coding

GOGGLES uses a set of affinity functions that can capture various affinities found in the image. Next, using a new unlabeled dataset and these affinity functions, GOGGLES constructs an affinity matrix, from which it can assign classes to unlabeled images. This doesn’t require any metadata or developer intervention like previous .

For each new dataset, users can potentially reuse many of the existing affinity functions already in the library, making GOGGLES a domain-agnostic labeling system. Users and developers can always add more affinity functions to increase the labeling power of GOGGLES.

On five common image classifying tasks, GOGGLES reaches up to 98 percent accuracy without requiring extensive developer effort. It also outperforms other well-known data programming systems by up to 21 percent.

Chu co-wrote the paper, GOGGLES: Automatic Image Labeling with Affinity Coding,  with Ph.D. students Nilaksh Das and Renzhi Wu, master’s alumni Sanya Chaba and Sakshi Gandhi, and School of Computational Science and Engineering Professor Polo Chau. They presented it at Association for Computing Machinery's Special Interest Group on Management of Data (SIGMOD) and Symposium on Principles of Database Systems (PODS) held virtually from June 14 to 19.