Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment

2025-04-10Unverified0· sign in to hype

Jiayang Sun, Hongbo Wang, Jie Cao, Huaibo Huang, Ran He

Unverified — Be the first to reproduce this paper.

Abstract

While diffusion models excel at generating high-quality images, they often struggle with accurate counting, attributes, and spatial relationships in complex multi-object scenes. One potential approach is to utilize Multimodal Large Language Model (MLLM) as an AI agent to build a self-correction framework. However, these approaches are highly dependent on the capabilities of the employed MLLM, often failing to account for all objects within the image. To address these challenges, we propose Marmot, a novel and generalizable framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting, enhancing image-text alignment and facilitating more coherent multi-object image editing. Our framework adopts a divide-and-conquer strategy, decomposing the self-correction task into object-level subtasks according to three critical dimensions: counting, attributes, and spatial relationships. We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism, effectively mitigating inter-object interference and enhancing editing reliability. To resolve the problem of subtask integration, we propose a Pixel-Domain Stitching Smoother that employs mask-guided two-stage latent space optimization. This innovation enables parallel processing of subtask results, thereby enhancing runtime efficiency while eliminating multi-stage distortion accumulation. Extensive experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.

Tasks

AI Agent Attribute Image Generation Large Language Model Multimodal Large Language Model Object Object Counting

Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment

Abstract

Tasks

Reproductions