SOTAVerified

Image Comprehension

Papers

Showing 125 of 49 papers

TitleStatusHype
Mini-Gemini: Mining the Potential of Multi-modality Vision Language ModelsCode7
JourneyDB: A Benchmark for Generative Image UnderstandingCode2
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation PerspectiveCode2
Divot: Diffusion Powers Video Tokenizer for Comprehension and GenerationCode2
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing DomainCode2
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language ModelsCode2
Enhancing Large Vision Language Models with Self-Training on Image ComprehensionCode2
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-ImprovementCode2
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video UnderstandingCode2
Hierarchical Open-vocabulary Universal Image SegmentationCode2
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional ComprehensionCode1
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM CollaborationCode1
ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced AdapterCode1
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMsCode1
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of ExpertsCode1
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained ClassificationCode0
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image InputsCode0
MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal RetrievalCode0
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and OutputCode0
VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-TuningCode0
RRHF-V: Ranking Responses to Mitigate Hallucinations in Multimodal Large Language Models with Human FeedbackCode0
FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image InsertionCode0
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and CompositionCode0
CLIC: Contrastive Learning Framework for Unsupervised Image Complexity RepresentationCode0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.