Primary links

The datasets Alex Gray works with are truly astronomical—in every sense of the word. Analogous to Moore's Law, which holds that computer speeds double every 18 months, is the even faster growth of the sizes of data collections in all fields—from document collections and business-transaction databases, to data collected by unprecedented international science instruments—many of which are in the realm of terabytes and petabytes.
Even conventional analyses of data, such as simple database queries, are too slow and impractical for the largest datasets.
Gray, an assistant professor in Computational Science and Engineering, is researching new data-structure designs and "smart" algorithms that can make advanced machine learning possible on massive datasets.
"Machine learning" refers to advanced data-analysis techniques focused on making predictions, finding patterns and forming intelligent decisions based on complex data. Unfortunately, the most powerful of such techniques are typically impossible to use on datasets with more than tens of thousands of records.
Gray's goal of "making state-of-the-art machine learning possible on modern problems" is carried out on several fronts, including his "Scalable Machine Learning for Astrostatistics" project, which earned him a coveted CAREER Award from the National Science Foundation.
Consider the ongoing search for quasars. As the oldest and brightest visible celestial objects, quasars are of intense interest to cosmologists because they are thought to be remnants of the Big Bang, meaning they are some of the oldest observable bits of the universe.
![]() Alex Gray, assistant professor in Computational Science and Engineering, won an NSF CAREER Award for his work in computational astrophysics. |
Next-generation telescopes will produce about a billion sky objects per month, according to Gray. Based on measurements in several frequency bands, he and his students, working with astrophysicist Gordon Richards of Drexel University and others, have been able to analyze such massive sky survey datasets to predict which of the myriad objects are actually those rare but critical quasars. Together the collaborators have produced the world's largest and most accurate catalog of quasars.
"Understanding them," Gray says, "provides clues as to the origin of the universe and the question of the existence of dark energy, which has the potential to revolutionize fundamental physics, once understood."
Toward a Breakthrough in Early Cancer Detection
A different set of issues is presented by datasets that are large, not because of the number of objects, but because of the high number of measurements (called features or dimensions) associated with each object.
The vast number of degrees of freedom in such datasets makes modeling and prediction difficult, Gray says, but here, too, his lab has produced impressive results with a statistical model for early screening for ovarian cancer. Working with Georgia Tech professors John McDonald (an ovarian cancer expert in the School of Biology) and Facundo Fernandez (a mass spectrometry expert in the School of Chemistry and Biochemistry), he and his lab modeled spectrographic blood-sample data from about 100 women with and without the disease.
While the number of data objects—in this case, women—was not as numerous as in the quasar example, each one was represented by 20,000 different measurements derived from spectrographic analysis. Using machine learning methods, Gray's team identified subtle differences in the patterns of blood protein masses that could signal potential cancer development.
"This is pretty exciting because it looks like we have the most accurate early detector of ovarian cancer so far," Gray says. "We're now working on pushing it to the next level, to make it a truly viable commercial diagnostic, which would be a first for this kind of cancer."
Central to many of the diverse applications explored in Gray's lab (which consists of about 25 people) is the development of new hierarchical data structures that allow computers to model massive datasets effectively and thus "touch" each piece of data only as much as needed.
"Much of how we achieve fast algorithms in computer science is through some kind of tree structure," explains Gray, whose lab both invents and mathematically analyzes such structures. He is also exploring ways to make dimensionally large datasets more manageable by re-representing them with fewer dimensions.
Although not every scientific endeavor lends itself to large datasets, "a lot of important problem areas already are swamped with [them]," Gray says. "I believe the trend is that virtually every area will have a large data problem—or opportunity."
This upward trend is manifest in a single data-modeling project that Gray and his team are already gearing up for: the Large Hadron Collider. The world's largest and most powerful particle accelerator is expected to generate 1 million data objects per second continuously for 15 years.

