Today, Google’s Open Images corpus for computer vision tasks got a boost with new visual relationships, human action annotations, and image-level labels, as well as a new form of multimodal annotations called localized narratives. Google says this last addition could create “potential avenues of research” for studying how people describe images, which could lead to interface design insights (and subsequent improvements) across web, desktop, and mobile apps.

In 2016, Google introduced Open Images, a data set of millions of labeled images spanning thousands of object categories. Major updates arrived in 2018 and 2019, bringing with them 15.4 million bounding-boxes for 600 object categories and segmentation masks (which mark the outline of objects) for 2.8 million object instances in 350 categories.

Enter link description

“Along with the data set itself, the associated Open Images challenges have spurred the latest advances in object detection, instance segmentation, and visual relationship detection,” wrote Jordi Pont-Tuset, a research scientist at Google Research. “Open Images is the largest annotated image data set in many regards, for use in training the latest deep convolutional neural networks for computer vision tasks.”

As Pont-Tuset explains, one of the motivations behind localized narratives is to leverage the connection between vision and language, which is typically done via image captioning (i.e., images paired with written descriptions of their content). But image captioning lacks visual “grounding.” To mitigate this, some researchers have drawn bounding boxes for the nouns in captions after the fact — in contrast to localized narratives, where every word in the description is grounded.

Enter a caption

The localized narratives in Open Images were generated by annotators who provided spoken descriptions of images while hovering over regions they were describing with a computer mouse. The annotators manually transcribed their description, after which Google researchers aligned it with automatic speech transcriptions, ensuring that the speech, text, and mouse trace were correct and synchronized.

“Speaking and pointing simultaneously are very intuitive, which allowed us to give the annotators very vague instructions about the task,” explained Pont-Tuset. “[This latest version of] Open Images is a significant qualitative and quantitative step toward improving the unified annotations for image classification, object detection, visual relationship detection, and instance segmentation … [we] hope that [it] will further stimulate progress toward genuine scene understanding.”

Open Images is freely available. According to Google, Open Images now has 9 million images annotated with 36 million image-level labels, 15.8 million bounding boxes, 2.8 million instance segmentations, 391,000 visual relationships, and 59.9 million human-verified image-level labels in nearly 20,000 categories.