Interactively Building a Discriminative Vocabulary of Nameable Attributes

 

Devi Parikh and Kristen Grauman



 

[In Spanish] [In Ukrainian by StudyCrumb essay writers] [In Hungarian by StudyBounty]

   


Abstract


Human-nameable visual attributes offer many advantages when used as mid-level features for object recognition, but existing techniques to gather relevant attributes can be inefficient (costing substantial effort or expertise) and/or insufficient (descriptive properties need not be discriminative). We introduce an approach to define a vocabulary of attributes that is both human understandable and discriminative. The system takes object/scene-labeled images as input, and returns as output a set of attributes elicited from human annotators that distinguish the categories of interest. To ensure a compact vocabulary and efficient use of annotators’ effort, we 1) show how to actively augment the vocabulary such that new attributes resolve inter-class confusions, and 2) propose a novel "nameability" manifold that prioritizes candidate attributes by their likelihood of being associated with a nameable property. We demonstrate the approach with multiple datasets, and show its clear advantages over baselines that lack a nameability model or rely on a list of expert-provided attributes. 

  


Motivation


To be most useful, attributes should be

 

Discriminative: so that they can be learnt reliably in the available feature-space, and can effectively classify the categories

 

and

 

Nameable: so that they can be used for zero-shot learning, describing previously unseen instances or unsual aspects of images, etc.

 

 

Existing Approaches Discriminative Nameable
Hand-generated list Not necessarily Yes
Mining the web Not necessarily Yes
Automatic splits of categories Yes No
Proposed Yes Yes

 

 


Proposal


We propose an interactive approach that prompts a human-in-the-loop to provide names for attribute hypotheses it discovers. The system takes as input a set of training images with their associated category labels, as well as one or more visual feature spaces (Gist, color, etc.), and returns as output a set of attribute models that together can distinguish the categories of interest. 

 

To visualize a candidate attribute for which the system seeks a name, a human is shown images sampled along the direction normal to some separating hyperplane in the feature space. Since many hypotheses will not correspond to something humans can visually identify and succinctly describe, a naive attribute discovery process — one that simply cycles through discriminative splits and asks the annotator to either name or reject them — is impractical. 

 

Instead, we design the approach to actively minimize the amount of meaningless inquiries presented to an annotator, so that human effort is mostly spent assigning meaning to divisions in feature space that actually have it, as opposed to discarding un-interpretable splits. 

 

We accomplish this with two key ideas: at each iteration, our approach:

1) focuses on attribute hypotheses that complement the classification power of existing attributes collected thus far, and 

2) predicts the nameability of each discriminative hypothesis and prioritizes those likely to be nameable. For this, we explore whether there exists some manifold structure in the space of nameable hyperplane separators.

 

    

 


Approach


There are three main challenges to be addressed in our proposed interactive approach: 

 

Discovering attribute hypotheses: We actively discover hyperplanes in the visual feature-space that separate a subset of classes that are currently most confused. We use iterative max-margin clustering to discover such a split.

 


 

Predicting the nameability of a hypothesis: At each iteration, we build a nameability manifold using  a mixture of probabilistic principal component analysis to fit the responses of the user collected so far. The manifold is learnt in the space of hyperplane parameters. As seen below, the manifold can effectively predict the nameability of a novel discriminative hyperplane.

 

 

 

Visualizing an attribute:  In order to present a visualization of a hyperplane to the user, we sample images from the dataset such that their distance orthogonal to the hyperplane varies, but any variations along the hyperplane are minimized. The user is then asked to name a visual property that is varying in the images from left to right. This name, along with the hyperplane parameters, forms our newly discovered attribute.

 

 

  

 


Evaluation


We evaluate our approach on two datasets of 8 categories each: Outdoor Scene Recognition (OSR) and a subset of the Animals with Attributes (AWA) dataset. For both datasets, we use gist and color features. 

 

In order to automatically evaluate our proposed approach, we collect nameability annotations of all discrimininative hyperplanes (247) in both feature-spaces of both datasets. We show a visualization of each of the hyperplanes to 20 Amazon Mechanical Turk subjects, and ask them to indicate how obvious of a change is visible in the images (on a scale of 1-4), and what the changing property is. Example responses are shown below:

"Black"

 

"Spotted"

 

Unnameable

 

"Green"

 

"Congested"

 

We consider a hyperplane to be nameable if the average 'obviousness' score received is above 3. This pool of annotated hyperplanes can now be used to conduct automatic experiments, while still mimicing a real user in the loop.

 


Results


Discriminative-only baseline: As compared to a baseline that presents the discriminative hyperplanes to the user without nameability modeling (see below), we find that our approach discovers more named attributes with the same user effort, also leading to better recognition performance. 

 

 

  

 

Descriptive-only baseline: On the other hand, as compared to purely descriptive attributes, our approach finds more discriminiative attributes, also leading to improved recognition performance (see below).

 

 

 

Automatically generated descriptions: Our discovered attributes can be used to describe previously seen and previously unseen (eg. zerba) images as seen below.

 

 

 


Publications


D. Parikh and K. Grauman

Interactively Building a Discriminiative Vocabulary of Nameable Attributes

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011

[supplementary material] [poster] [slides]

 

D. Parikh and K. Grauman

Interactive Discovery of Task-Specific Nameable Attributes (Abstract)

First Workshop on Fine-Grained Visual Categorization (FGVC)

held in conjunction with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011 (Best Poster Award)

[poster]

 

 

[Thanks to Yong Jae Lee for the webpage template]