| Patch-level Sounding Object Tracking for Audio-Visual Question Answering | Dec 14, 2024 | Audio-visual Question AnsweringObject Tracking | —Unverified | 0 |
| VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation | Dec 13, 2024 | Instruction FollowingQuestion Answering | —Unverified | 0 |
| ViUniT: Visual Unit Tests for More Robust Visual Programming | Dec 12, 2024 | Image GenerationImage-text matching | —Unverified | 0 |
| Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering | Dec 11, 2024 | Explainable artificial intelligenceExplainable Artificial Intelligence (XAI) | CodeCode Available | 0 |
| Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions | Dec 11, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses | Dec 11, 2024 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey | Dec 11, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| A Multimodal Social Agent | Dec 11, 2024 | Common Sense ReasoningDecision Making | —Unverified | 0 |
| Can We Generate Visual Programs Without Prompting LLMs? | Dec 11, 2024 | Data AugmentationQuestion Answering | —Unverified | 0 |
| MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models | Dec 10, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | Dec 9, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels | Dec 9, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering | Dec 9, 2024 | Knowledge DistillationQuestion Answering | CodeCode Available | 0 |
| Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora | Dec 6, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | —Unverified | 0 |
| EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation | Dec 6, 2024 | MMEQuestion Answering | —Unverified | 0 |
| T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts | Dec 5, 2024 | BenchmarkingImage Generation | —Unverified | 0 |
| Copy-Move Forgery Detection and Question Answering for Remote Sensing Image | Dec 3, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey | Dec 3, 2024 | Cross-Modal RetrievalNatural Language Understanding | —Unverified | 0 |
| CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs | Dec 3, 2024 | Image CaptioningQuantization | —Unverified | 0 |
| Understanding the World's Museums through Vision-Language Reasoning | Dec 2, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness | Nov 29, 2024 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 0 |
| SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks | Nov 29, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs | Nov 28, 2024 | AttributeHallucination | —Unverified | 0 |
| Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers | Nov 28, 2024 | Image Captioningimage-classification | —Unverified | 0 |