Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models

2025-02-04Unverified0· sign in to hype

Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen

Unverified — Be the first to reproduce this paper.

Abstract

Large vision-and-language models (LVLMs) typically treat visual and textual embeddings as homogeneous inputs to a large language model (LLM). However, these inputs are inherently different: visual inputs are multi-dimensional and contextually rich, often pre-encoded by models like CLIP, while textual inputs lack this structure. In this paper, we propose Decomposed Attention (D-Attn), a novel method that processes visual and textual embeddings differently by decomposing the 1-D causal self-attention in LVLMs. After the attention decomposition, D-Attn diagonalizes visual-to-visual self-attention, reducing computation from O(|V|^2) to O(|V|) for |V| visual embeddings without compromising performance. Moreover, D-Attn debiases positional encodings in textual-to-visual cross-attention, further enhancing visual understanding. Finally, we introduce an -weighting strategy to merge visual and textual information, maximally preserving the pre-trained LLM's capabilities with minimal modifications. Extensive experiments and rigorous analyses validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs. Code, data, and models will be publicly available.

Tasks

Language Modeling Language Modelling Large Language Model

Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models

Abstract

Tasks

Reproductions