| H2OVL-Mississippi Vision Language Models Technical Report | Oct 17, 2024 | Document AIVisual Question Answering | —Unverified | 0 |
| Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts? | Oct 17, 2024 | AllLanguage Modeling | CodeCode Available | 0 |
| γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models | Oct 17, 2024 | Visual Question Answering | —Unverified | 0 |
| Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models | Oct 16, 2024 | Visual Question Answering | —Unverified | 0 |
| WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines | Oct 16, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| VividMed: Vision Language Model with Versatile Visual Grounding for Medicine | Oct 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| OMCAT: Omni Context Aware Transformer | Oct 15, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding | Oct 15, 2024 | Visual Question Answering | CodeCode Available | 2 |
| Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs | Oct 15, 2024 | Image DescriptionMultiple-choice | CodeCode Available | 0 |
| Towards Foundation Models for 3D Vision: How Close Are We? | Oct 14, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling | Oct 14, 2024 | DenoisingImage Generation | —Unverified | 0 |
| Eliminating the Language Bias for Visual Question Answering with fine-grained Causal Intervention | Oct 14, 2024 | Contrastive Learningcounterfactual | —Unverified | 0 |
| Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models | Oct 13, 2024 | Instruction FollowingQuestion Answering | —Unverified | 0 |
| MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models | Oct 13, 2024 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets | Oct 12, 2024 | Knowledge DistillationQuestion Answering | CodeCode Available | 0 |
| Zero-shot Commonsense Reasoning over Machine Imagination | Oct 12, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | Oct 12, 2024 | DiversityHallucination | —Unverified | 0 |
| Skipping Computations in Multimodal LLMs | Oct 12, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping | Oct 11, 2024 | MMEQuestion Answering | CodeCode Available | 1 |
| Baichuan-Omni Technical Report | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation | Oct 11, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 |
| VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis | Oct 10, 2024 | Medical Image AnalysisQuestion Answering | CodeCode Available | 2 |
| Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision | Oct 10, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training | Oct 10, 2024 | Mixture-of-ExpertsVisual Question Answering | —Unverified | 0 |
| PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models | Oct 9, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | Oct 9, 2024 | cross-modal alignmentVisual Question Answering | CodeCode Available | 2 |
| Large Continual Instruction Assistant | Oct 8, 2024 | Question AnsweringSemantic Similarity | CodeCode Available | 2 |
| Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning | Oct 8, 2024 | Image RetrievalMath | —Unverified | 0 |
| ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments | Oct 8, 2024 | DecoderQuestion Answering | CodeCode Available | 0 |
| Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond | Oct 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Core Tokensets for Data-efficient Sequential Training of Transformers | Oct 8, 2024 | Image Captioningimage-classification | CodeCode Available | 0 |
| TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | Oct 8, 2024 | Change DetectionEarth Observation | CodeCode Available | 2 |
| VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks | Oct 7, 2024 | Information RetrievalLanguage Modeling | —Unverified | 0 |
| MM-R^3: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) | Oct 7, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models | Oct 7, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration | Oct 6, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions | Oct 5, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| Gamified crowd-sourcing of high-quality data for visual fine-tuning | Oct 5, 2024 | Visual Question Answering | —Unverified | 0 |
| Backdooring Vision-Language Models with Out-Of-Distribution Data | Oct 2, 2024 | Image CaptioningImage to text | —Unverified | 0 |
| Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities | Oct 2, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models | Oct 1, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | Oct 1, 2024 | BenchmarkingFairness | —Unverified | 0 |
| BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data | Oct 1, 2024 | Code GenerationLogical Reasoning | CodeCode Available | 0 |
| A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning | Oct 1, 2024 | Common Sense ReasoningDeepFake Detection | CodeCode Available | 1 |
| MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | Sep 30, 2024 | Mixture-of-ExpertsOptical Character Recognition (OCR) | —Unverified | 0 |
| World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering | Sep 30, 2024 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 0 |
| T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition | Sep 29, 2024 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| TrojVLM: Backdoor Attack Against Vision Language Models | Sep 28, 2024 | Backdoor AttackImage Captioning | —Unverified | 0 |
| 3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models | Sep 28, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 |
| Enhancing Explainability in Multimodal Large Language Models Using Ontological Context | Sep 27, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |