Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

2025-04-30Unverified0· sign in to hype

Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, QIngwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu

arXiv PDF

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omission of critical information, hampering performance. To address these limitations, we introduce , a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits. features three key innovations: a prompt-aware strategy that dynamically highlights relevant image regions, a spatial-preserving orchestration schema that maintains object integrity, and a budget-aware prompting method that balances global context with crucial visual details. Comprehensive evaluations across multiple datasets demonstrate that consistently outperforms baseline methods, achieving up to a 26.9\% improvement in accuracy while significantly reducing token consumption.

Tasks

Image Captioning Object Recognition Question Answering Visual Prompting

Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Abstract

Tasks

Reproductions