IC3: Image Captioning by Committee Consensus

David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, John Canny
University of California, Berkeley, Google AI
The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Teaser image

In the IC3 (Image Captioning by Committee Consensus) method, we first leverage standard image captioning models to generate descriptions covering a range of content within the image, similar to how human raters describe the image from independent and unique points of view. We then summarize the group of captions using a vision-free summarization model into a single, high-quality description of the image, suitable for use in visual description applications.

Abstract

If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions, however, doing so encourages captions that are viewpoint-impoverished. Such captions often focus on only a subset of the possible details, while ignoring potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC3), designed to generate a single caption that captures high-level details from several viewpoints. Notably, humans rate captions produced by IC3 at least as helpful as baseline SOTA models more than two thirds of the time, and IC3 captions can improve the performance of SOTA automated recall systems by up to 84%, indicating significant material improvements over existing SOTA approaches for visual description.

BibTeX

@inproceedings{chan-etal-2023-ic3,
    title = "IC3: Image Captioning by Committee Consensus",
    author = "Chan, David M and
        Myers, Austin and
        Vijayanarasimhan, Sudheendra and
        Ross, David A and
        Canny, John",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Association for Computational Linguistics",
}