MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

2026-03-19Code Available0· sign in to hype

Fan Yang, Xingping Dong, Xin Yu, Wenhan Luo, Wei Liu, Kaihao Zhang

Code Available — Be the first to reproduce this paper.

Code

github.com/yf0412/mrd
OfficialIn paper★ 2

Abstract

Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization priors within a unified framework. Extensive experiments across multiple MLLMs on HR image benchmarks demonstrate that MRD achieves state-of-the-art (SOTA) performance on both single-object and multi-object understanding tasks. Code will be available at: https://github.com/yf0412/MRD.

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Code

Abstract

Reproductions