Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

2024-04-11CVPR 2024Code Available0· sign in to hype

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin

Code Available — Be the first to reproduce this paper.

Code

github.com/kahnchana/locvlm
none★ 6

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Tasks

Descriptive Hallucination Question Answering Spatial Reasoning Video Question Answering Visual Question Answering Visual Question Answering (VQA)Zero-Shot Region Description

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ActivityNet-QA	LocVLM-Vid-B+	Accuracy	38.2	—	Unverified
ActivityNet-QA	LocVLM-Vid-B	Accuracy	37.4	—	Unverified
MSR-VTT	LocVLM-Vid-B	Accuracy	51.2	—	Unverified
MSVD-QA	LocVLM-Vid-B	Accuracy	66.1	—	Unverified
TGIF-QA	LocVLM-Vid-B	Accuracy	51.8	—	Unverified

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Code

Abstract

Tasks

Benchmark Results

Reproductions