CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

2024-01-05Unverified0· sign in to hype

Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan YAO, Mingkai Chen, Jiebo Luo

Unverified — Be the first to reproduce this paper.

Abstract

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.

Tasks

Image Comprehension Image to text Text Matching Visual Reasoning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Winoground	MMICL + CoCoT	Text Score	64.25	—	Unverified
Winoground	GPT-4V + CoCoT	Text Score	58.5	—	Unverified
Winoground	OpenFlamingo + CoCoT	Text Score	58.25	—	Unverified
Winoground	GPT-4V	Text Score	54.5	—	Unverified
Winoground	MMICL + CCoT	Text Score	51	—	Unverified
Winoground	OpenFlamingo + DDCoT	Text Score	47.5	—	Unverified
Winoground	MMICL + DDCoT	Text Score	46.75	—	Unverified
Winoground	Gemini + DDCoT	Text Score	45	—	Unverified
Winoground	OpenFlamingo + CCoT	Text Score	42.5	—	Unverified
Winoground	Gemini + CoCoT	Text Score	40	—	Unverified
Winoground	OpenFlamingo	Text Score	39	—	Unverified
Winoground	Gemini	Text Score	30.75	—	Unverified
Winoground	Gemini + CCoT	Text Score	22.5	—	Unverified

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Abstract

Tasks

Benchmark Results

Reproductions