| Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | Mar 27, 2024 | Image ClassificationImage Comprehension | CodeCode Available | 7 | 5 |
| JourneyDB: A Benchmark for Generative Image Understanding | Jul 3, 2023 | Image CaptioningImage Comprehension | CodeCode Available | 2 | 5 |
| MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective | Nov 21, 2024 | Image ComprehensionImage Generation | CodeCode Available | 2 | 5 |
| Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation | Dec 5, 2024 | Image ComprehensionRepresentation Learning | CodeCode Available | 2 | 5 |
| EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain | Jan 30, 2024 | Image ComprehensionInstruction Following | CodeCode Available | 2 | 5 |
| MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models | Aug 5, 2024 | Image ComprehensionMultiple-choice | CodeCode Available | 2 | 5 |
| Enhancing Large Vision Language Models with Self-Training on Image Comprehension | May 30, 2024 | Image ComprehensionVisual Question Answering | CodeCode Available | 2 | 5 |
| Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | May 24, 2024 | HallucinationImage Comprehension | CodeCode Available | 2 | 5 |
| StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding | Nov 6, 2024 | Image ComprehensionStreaming video understanding | CodeCode Available | 2 | 5 |
| Hierarchical Open-vocabulary Universal Image Segmentation | Jul 3, 2023 | Image ComprehensionImage Segmentation | CodeCode Available | 2 | 5 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 | 5 |
| RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension | Aug 3, 2023 | Image Comprehension | CodeCode Available | 1 | 5 |
| New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration | Feb 27, 2025 | Image ComprehensionReferring Expression | CodeCode Available | 1 | 5 |
| ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter | May 12, 2023 | Image ComprehensionLanguage Modelling | CodeCode Available | 1 | 5 |
| Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs | Jul 31, 2024 | HallucinationImage Comprehension | CodeCode Available | 1 | 5 |
| RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | Dec 7, 2024 | Change DetectionImage Comprehension | CodeCode Available | 1 | 5 |
| MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification | Apr 7, 2024 | Image ComprehensionMath | CodeCode Available | 0 | 5 |
| CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | Jan 5, 2024 | Image ComprehensionImage to text | CodeCode Available | 0 | 5 |
| MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval | Nov 13, 2024 | Image ComprehensionInformation Retrieval | CodeCode Available | 0 | 5 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | CodeCode Available | 0 | 5 |
| VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning | Jun 20, 2024 | Image ComprehensionQuestion Answering | CodeCode Available | 0 | 5 |
| RRHF-V: Ranking Responses to Mitigate Hallucinations in Multimodal Large Language Models with Human Feedback | Jan 1, 2025 | HallucinationImage Comprehension | CodeCode Available | 0 | 5 |
| FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion | Oct 16, 2024 | ArticlesImage Comprehension | CodeCode Available | 0 | 5 |
| InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition | Sep 26, 2023 | ArticlesImage Comprehension | CodeCode Available | 0 | 5 |
| CLIC: Contrastive Learning Framework for Unsupervised Image Complexity Representation | Nov 19, 2024 | AttributeContrastive Learning | CodeCode Available | 0 | 5 |
| Multiplane Prior Guided Few-Shot Aerial Scene Rendering | Jun 7, 2024 | Image ComprehensionNeRF | —Unverified | 0 | 0 |
| An End-to-End OCR Text Re-organization Sequence Learning for Rich-text Detail Image Comprehension | Aug 1, 2020 | Decoderglobal-optimization | —Unverified | 0 | 0 |
| Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension | Nov 9, 2024 | Image ComprehensionLanguage Modeling | —Unverified | 0 | 0 |
| GeoLocator: a location-integrated large multimodal model for inferring geo-privacy | Nov 21, 2023 | Image Comprehension | —Unverified | 0 | 0 |
| CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation | Mar 7, 2025 | Image ComprehensionMemorization | —Unverified | 0 | 0 |
| CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs | May 30, 2025 | DiagnosticImage Comprehension | —Unverified | 0 | 0 |
| EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM | Dec 12, 2024 | Image ComprehensionImage Generation | —Unverified | 0 | 0 |
| FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs | Sep 20, 2024 | Image CaptioningImage Comprehension | —Unverified | 0 | 0 |
| Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine | Jan 16, 2024 | DiagnosticImage Comprehension | —Unverified | 0 | 0 |
| IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web | Sep 14, 2024 | Image Comprehension | —Unverified | 0 | 0 |
| Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models | Jan 10, 2025 | FormImage Comprehension | —Unverified | 0 | 0 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 | 0 |
| Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation | Aug 1, 2024 | HallucinationImage Comprehension | —Unverified | 0 | 0 |
| On the Performance of Multimodal Language Models | Oct 4, 2023 | BenchmarkingBinary Classification | —Unverified | 0 | 0 |
| RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving | Mar 18, 2025 | Autonomous DrivingDecision Making | —Unverified | 0 | 0 |
| Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models | Feb 13, 2024 | Image ComprehensionMultimodal Recommendation | —Unverified | 0 | 0 |
| RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models | Mar 25, 2025 | Image ComprehensionVisual Reasoning | —Unverified | 0 | 0 |
| SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models | Feb 18, 2025 | Image ComprehensionQuestion Answering | —Unverified | 0 | 0 |
| SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition | Jan 18, 2024 | Audio-Visual Speech RecognitionAutomatic Speech Recognition | —Unverified | 0 | 0 |
| Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges | Dec 4, 2024 | Code GenerationImage Comprehension | —Unverified | 0 | 0 |
| Teach Multimodal LLMs to Comprehend Electrocardiographic Images | Oct 21, 2024 | DiagnosticImage Comprehension | —Unverified | 0 | 0 |
| Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens | Sep 15, 2023 | Image ComprehensionLanguage Modeling | —Unverified | 0 | 0 |
| Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP | Jun 30, 2024 | HallucinationImage Comprehension | —Unverified | 0 | 0 |
| What Large Language Models Bring to Text-rich VQA? | Nov 13, 2023 | Image ComprehensionOptical Character Recognition (OCR) | —Unverified | 0 | 0 |