With dozens of research papers about COVID-19 being published each week, it can be difficult for doctors and scientists to read the most important studies.
A student at Georgia Tech, however, is using artificial intelligence (AI) techniques like natural language processing and machine learning (ML) to narrow down the most relevant information in this growing data set.
Kenneth Miller, a student in Georgia Tech’s Online Master of Science in Computer Science (OMSCS), is using these tools to develop algorithms to ensure that the most important COVID-19 research reaches doctors. His work is part of an ongoing challenge to use ML to empower the medical community to find the best COVID-19 studies.
The challenge started when Kaggle, a Google data science and ML community, partnered with the White House and several leading research groups to create the COVID-19 Open Research Dataset (CORD-19). With more than 47,000 scholarly articles about COVID-19 and other coronaviruses, it’s one of the most comprehensive research databases for the pandemic.
To sift through the data, Kaggle released CORD-19 to its community and asked them to use it to answer some of the toughest research questions about COVID-19. As incentive, for every task completed successfully, participants like Miller receive $1,000 in prize money.
As an OMSCS student specializing in ML, Miller has joined a few previous Kaggle challenges, but for much less significant tasks like home values or NCAA brackets. For Miller, working on this dataset presented an especially relevant problem.
“I am fascinated with everything AI, so when I heard about this, I figured if any of my skills could help anyone, I should try,” said Miller, who is a lawyer outside of his studies.
Keep it Simple
Miller said his OMSCS studies prepared him for the challenge. The AI track focuses on the practical implementation of AI methods. This made it easier for Miller to start with an overwhelming amount of data and get to an endpoint that solves the problem. His experience using the programming language Python for class also enabled him to agilely work with the data.
Armed with this knowledge, Miller applied a strategy he uses on every project.
“Whenever I start a new project, I try and see if I can craft a simple yet effective solution from scratch,” he said.
He has worked on specific Kaggle challenges he can apply this strategy. The first ML model Miller developed finds the most relevant sentences in a study. To accomplish this, he used a simple scoring algorithm that determines how many times keywords appear in a sentence. Then the model measures the ratio of keyword occurrences to sentence length.
For a separate challenge, Miller created a search engine for common COVID-19 research questions, such as: What is the average time the disease takes to incubate? How long is it contagious? How long until symptoms appear?
Up to the Challenge
These are just a few of Miller’s models, and he continues to work on new challenges Kaggle offers. Tasks now include deep dives into epidemiology, understanding how many patients a study was based on, and what scientific method was employed.
“The trick, as in any project like this, is understanding and assimilating the data to start with,” Miller said. “But using Python makes the initial data wrangling pretty easy. The hardest part is building new ways to squeeze more desired info out of the documents.”
Miller’s efforts have been noticed. His work has been cited several times on the contributions page.