Georgia Institute of Technology assistant professors Dhruv Batra and Devi Parikh, both in the College of Computing’s School of Interactive Computing, received Amazon Academic Research Awards (AARA) for a pair of projects they are leading in computer vision and machine learning.
They received $100,000 each from Amazon -- $80,000 in gift money and $20,000 in Amazon Web Services credit – in December for projects that aim to produce the next generation of artificial intelligence (AI) agents.
One, Visual Dialog, requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. The other is called Counting Everyday Objects in Everyday Scenes and aims to enable an AI to do just that.
The former project is led by Batra. His main goal is to build a visual chatbot, or an AI agent with the ability to have a natural conversation about visual content. In other words, you could ask it a question about a specific uploaded image – how many people are in the image, for example – and the AI would reply with an accurate answer (“There are five people”).
“Then, I could ask a follow-up question like, ‘What are they doing?’ and the bot would answer, ‘One is jumping,’” Batra explained.
The process to accomplish this relies on a subfield of machine learning called deep learning. In order to train deep neural networks to accomplish a task, Batra suggests that they can collect a large data set where two humans interact with each other about an image. One asks a question and the other answers. Their conversations are recorded and used to train machine learning models to produce the desired responses.
“We can show a machine that when a question is asked about this image, this is how you should respond,” Batra said.
This requires a large data set, as many as 200,000 conversations on the same set of images, each conversation including 10 rounds of questions and answers to equate to roughly two million question-and-answer pairs.
The latter project, led by Parikh, focuses on counting objects in images.
“If I give the machine any image, can we get machines to count the number of objects of each category in that scene?” she said. “Are there chairs? How many chairs? Are there tables? How many tables? How many people? And so on.”
This project also leverages deep learning. In particular, Parikh’s lab will build artificial neural network architectures that try to count objects in a variety of ways.
One particularly interesting approach will try to estimate the counts of objects at one shot by just glancing at the image as a whole. This is inspired by subitizing, an ability humans have to see a small number of objects and know how many there are without having to explicitly count.
Other models Parikh’s lab is developing have the ability to estimate partial counts from parts of objects and incorporate contextual information. For instance, in an office scene if there are two desks, there are probably at least two chairs.
Both projects are broadly applicable. Notably, both could be used to aid visually impaired users. Parikh suggested the latter project could help in automatic inventory management, as well.
“For example, it’s of interest to Amazon in particular because they have these warehouses and, for a while at least, they just had people walking around and counting how many boxes there are and what each type is,” she said. “If you can have an automatic agent that can just count, it can save a lot of money for these companies.”
Batra and Parikh run labs that collaborate on many projects. They will work together on the intersection of these two in particular.