| MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression | Feb 17, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? | Feb 6, 2025 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models | Feb 3, 2025 | Adversarial RobustnessImage Captioning | CodeCode Available | 1 |
| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 |
| Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering | Jan 1, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization | Dec 19, 2024 | Contrastive LearningDecision Making | CodeCode Available | 1 |
| MedCoT: Medical Chain of Thought via Hierarchical Expert | Dec 18, 2024 | DiagnosticMedical Visual Question Answering | CodeCode Available | 1 |
| MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants | Dec 17, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents | Dec 10, 2024 | Cross-Modal RetrievalImage Classification | CodeCode Available | 1 |
| ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models | Dec 9, 2024 | Graph GenerationScene Graph Generation | CodeCode Available | 1 |
| RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | Dec 7, 2024 | Change DetectionImage Comprehension | CodeCode Available | 1 |
| MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale | Dec 6, 2024 | Multimodal ReasoningVisual Question Answering | CodeCode Available | 1 |
| A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs | Dec 4, 2024 | Visual Question Answering | CodeCode Available | 1 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 |
| Cross-modal Information Flow in Multimodal Large Language Models | Nov 27, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Teaching VLMs to Localize Specific Objects from In-context Examples | Nov 20, 2024 | ObjectObject Tracking | CodeCode Available | 1 |
| A Survey of Medical Vision-and-Language Applications and Their Techniques | Nov 19, 2024 | Decision MakingDiagnostic | CodeCode Available | 1 |
| BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense Evaluation | Nov 17, 2024 | Action Recognitionbackdoor defense | CodeCode Available | 1 |
| Nearest Neighbor Normalization Improves Multimodal Retrieval | Oct 31, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | Oct 31, 2024 | Change DetectionQuestion Answering | CodeCode Available | 1 |
| ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning | Oct 23, 2024 | Image CaptioningInstruction Following | CodeCode Available | 1 |
| Progressive Compositionality In Text-to-Image Generative Models | Oct 22, 2024 | AttributeContrastive Learning | CodeCode Available | 1 |
| MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems | Oct 18, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines | Oct 16, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| VividMed: Vision Language Model with Versatile Visual Grounding for Medicine | Oct 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Towards Foundation Models for 3D Vision: How Close Are We? | Oct 14, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Skipping Computations in Multimodal LLMs | Oct 12, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping | Oct 11, 2024 | MMEQuestion Answering | CodeCode Available | 1 |
| ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models | Oct 7, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration | Oct 6, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning | Oct 1, 2024 | Common Sense ReasoningDeepFake Detection | CodeCode Available | 1 |
| T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition | Sep 29, 2024 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE | Sep 26, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models | Sep 23, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering | Sep 19, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 1 |
| Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs | Sep 17, 2024 | Question AnsweringToken Reduction | CodeCode Available | 1 |
| LIME: Less Is More for MLLM Evaluation | Sep 10, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| M3-Jepa: Multimodal Alignment via Multi-directional MoE based on the JEPA framework | Sep 9, 2024 | Computational EfficiencyCross-Modal Retrieval | CodeCode Available | 1 |
| V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard? | Aug 20, 2024 | Few-Shot LearningIn-Context Learning | CodeCode Available | 1 |
| Visual Agents as Fast and Slow Thinkers | Aug 16, 2024 | Question AnsweringReasoning Segmentation | CodeCode Available | 1 |
| Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery | Aug 9, 2024 | Contrastive LearningMedical Visual Question Answering | CodeCode Available | 1 |
| Boosting Audio Visual Question Answering via Key Semantic-Aware Cues | Jul 30, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model | Jul 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Learning Trimodal Relation for AVQA with Missing Modality | Jul 23, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark | Jul 18, 2024 | GPUImage Retrieval | CodeCode Available | 1 |
| CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation | Jul 1, 2024 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment | Jun 28, 2024 | Answer GenerationImage Captioning | CodeCode Available | 1 |
| STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering | Jun 28, 2024 | Medical DiagnosisMedical Question Answering | CodeCode Available | 1 |