| Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation | Oct 17, 2024 | Visual Question Answering | CodeCode Available | 11 |
| H2OVL-Mississippi Vision Language Models Technical Report | Oct 17, 2024 | Document AIVisual Question Answering | —Unverified | 0 |
| γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models | Oct 17, 2024 | Visual Question Answering | —Unverified | 0 |
| WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines | Oct 16, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models | Oct 16, 2024 | Visual Question Answering | —Unverified | 0 |
| VividMed: Vision Language Model with Versatile Visual Grounding for Medicine | Oct 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| OMCAT: Omni Context Aware Transformer | Oct 15, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs | Oct 15, 2024 | Image DescriptionMultiple-choice | CodeCode Available | 0 |
| MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding | Oct 15, 2024 | Visual Question Answering | CodeCode Available | 2 |
| MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling | Oct 14, 2024 | DenoisingImage Generation | —Unverified | 0 |
| Towards Foundation Models for 3D Vision: How Close Are We? | Oct 14, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Eliminating the Language Bias for Visual Question Answering with fine-grained Causal Intervention | Oct 14, 2024 | Contrastive Learningcounterfactual | —Unverified | 0 |
| Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models | Oct 13, 2024 | Instruction FollowingQuestion Answering | —Unverified | 0 |
| MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models | Oct 13, 2024 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets | Oct 12, 2024 | Knowledge DistillationQuestion Answering | CodeCode Available | 0 |
| Zero-shot Commonsense Reasoning over Machine Imagination | Oct 12, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Skipping Computations in Multimodal LLMs | Oct 12, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | Oct 12, 2024 | DiversityHallucination | —Unverified | 0 |
| Baichuan-Omni Technical Report | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping | Oct 11, 2024 | MMEQuestion Answering | CodeCode Available | 1 |
| ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation | Oct 11, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 |
| VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis | Oct 10, 2024 | Medical Image AnalysisQuestion Answering | CodeCode Available | 2 |
| Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training | Oct 10, 2024 | Mixture-of-ExpertsVisual Question Answering | —Unverified | 0 |
| Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision | Oct 10, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models | Oct 9, 2024 | Question AnsweringRetrieval | —Unverified | 0 |