SOTAVerified

Image Comprehension

Papers

Showing 125 of 49 papers

TitleStatusHype
CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs0
RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models0
RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving0
CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation0
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM CollaborationCode1
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models0
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models0
RRHF-V: Ranking Responses to Mitigate Hallucinations in Multimodal Large Language Models with Human FeedbackCode0
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM0
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of ExpertsCode1
Divot: Diffusion Powers Video Tokenizer for Comprehension and GenerationCode2
Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges0
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation PerspectiveCode2
CLIC: Contrastive Learning Framework for Unsupervised Image Complexity RepresentationCode0
MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal RetrievalCode0
Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension0
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video UnderstandingCode2
Teach Multimodal LLMs to Comprehend Electrocardiographic Images0
FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image InsertionCode0
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs0
IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web0
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language ModelsCode2
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation0
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMsCode1
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.