| Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | Dec 12, 2024 | EgoSchema | CodeCode Available | 3 |
| A Multimodal Social Agent | Dec 11, 2024 | Common Sense ReasoningDecision Making | —Unverified | 0 |
| Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions | Dec 11, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey | Dec 11, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses | Dec 11, 2024 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering | Dec 11, 2024 | Explainable artificial intelligenceExplainable Artificial Intelligence (XAI) | CodeCode Available | 0 |
| Can We Generate Visual Programs Without Prompting LLMs? | Dec 11, 2024 | Data AugmentationQuestion Answering | —Unverified | 0 |
| IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents | Dec 10, 2024 | Cross-Modal RetrievalImage Classification | CodeCode Available | 1 |
| BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities | Dec 10, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 2 |
| MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models | Dec 10, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models | Dec 9, 2024 | Graph GenerationScene Graph Generation | CodeCode Available | 1 |
| FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering | Dec 9, 2024 | Knowledge DistillationQuestion Answering | CodeCode Available | 0 |
| Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels | Dec 9, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | Dec 9, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action | Dec 7, 2024 | Depth EstimationMathematical Reasoning | CodeCode Available | 2 |
| RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | Dec 7, 2024 | Change DetectionImage Comprehension | CodeCode Available | 1 |
| Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora | Dec 6, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LinVT: Empower Your Image-level Large Language Model to Understand Videos | Dec 6, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale | Dec 6, 2024 | Multimodal ReasoningVisual Question Answering | CodeCode Available | 1 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | —Unverified | 0 |
| EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation | Dec 6, 2024 | MMEQuestion Answering | —Unverified | 0 |
| VisionZip: Longer is Better but Not Necessary in Vision Language Models | Dec 5, 2024 | Video UnderstandingVisual Question Answering | CodeCode Available | 3 |
| T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts | Dec 5, 2024 | BenchmarkingImage Generation | —Unverified | 0 |
| FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression | Dec 5, 2024 | DescriptiveVisual Question Answering | CodeCode Available | 2 |
| A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs | Dec 4, 2024 | Visual Question Answering | CodeCode Available | 1 |