DNA is often called the “language of life.” As an example of a mathematical string, DNA is simply—and almost impenetrably—a massive string of four characters: A, C, G and T. Alberto Apostolico talks about DNA a lot.
“It’s the closest thing we have to a message from outer space,” says Apostolico, professor in the schools of Computational Science & Engineering and Interactive Computing. “We do not know where it comes from, understand very little of what it means and have no clue about where it is going.”
Apostolico works in string algorithms and combinatorics, two foundational areas of computer science. At its core, his research can be described as an extremely sophisticated form of pattern matching. Traditionally there’s been one central application area that’s served by such research, and that is the search domain. However, as computational capabilities have increased, so have the possible applications.
“The original problems in pattern matching were string searches, and the original charter of the field itself was to do searches,” Apostolico says. “But now there’s a shift to pattern discovery, meaning the combination of extraction of interesting patterns while at the same time assessing the probability that such patterns could happen by chance.”
Genomics, he says, is the ideal arena for pattern discovery on a massive basis, due to the universe of genomic data now being produced (“There are more beautiful problems in biology than we can supply solutions,” he says). But it’s far from the only one.
In information security, the emergence of events or behaviors that depart signiﬁcantly from routine is cause for alert and scrutiny. Unusually dense sequences of highly similar messages may indicate fraudulent activity or a prelude to malicious attack. In computer architecture and software engineering, identifying frequent or similarcode segments can help to optimize hardware design or compiler parallelization.
The same techniques can help spot plagiarism and other forms of cheating—not just in computing and not just by students. Surprisingly high correlations in answer sheets exposed Chicago schoolteachers who were fixing tests to inflate their classrooms’ averages, as detailed in the 2006 book Freaknomics. Pattern-discovery algorithms like those Apostolico designs helped spot those anomalies.
There are even artistic applications: The application of different measures of text similarity to a corpus of Japanese poetry has revealed crucial connections (or honkadori, a specific kind of poetic allusion) previously unnoticed in literary circles.
“From my point of view, string algorithms have contributed some of the most visible, impactful applications in the world—namely, web text searching and sequencing the human genome, things that have changed everyday life,” Apostolico says.
His own everyday life has been occupied by strings since the late 1970s, when he was among the pioneers in the field. Indeed, in 1985 he co-edited the seminal Combinatorial Algorithms on Words, considered the manifesto of the field and still a foundational text. Apostolico’s co-editor? None other than Zvi Galil, now John P. Imlay Jr. Dean of Computing at Georgia Tech. A year earlier the pair had co-organized a NATO workshop and decided to collect in book form the sparse state of the art in string algorithms.
At that time, Apostolico estimates, there were only about 30 people working in the field. Ten years later, when they edited Pattern Matching Algorithms, the field had developed into a full-fledged CS specialty—one now undergoing “enormous changes.”
“How do you compare things that are essentially too big to compare, meaning that the old ways of computing are no longer feasible, meaningful, or both?” Apostolico says. “It’s one thing to compare and classify 30 proteins that are a thousand characters long; it’s another to compare a million species by their entire genomes, and then come up with a classification system for those species.”
Apostolico once captured this conundrum in a keynote paper, “Of Maps Bigger Than the Empire,” that examined cases of data mining and pattern discovery in which “synopses, indices and relationships seem to grow faster and bigger than the phenomena they were meant to encapsulate.” He took the paper’s title from ”On Exactitude in Science,” Argentinian author Juan Luis Borges’ tale of a fictitious kingdom whose cartographers created maps of increasingly large scale, until eventually they were every bit the size of the geographies they were created to represent (thus becoming useless).