| FFA Sora, video generation as fundus fluorescein angiography simulator | Dec 23, 2024 | Privacy PreservingQuestion Answering | —Unverified | 0 |
| Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective | Dec 23, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy | Dec 23, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering | Dec 22, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization | Dec 21, 2024 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 |
| NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization | Dec 20, 2024 | Compositional Generalization (AVG)Novel Concepts | CodeCode Available | 0 |
| Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization | Dec 19, 2024 | Contrastive LearningDecision Making | CodeCode Available | 1 |
| AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving | Dec 19, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| FedPIA -- Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning | Dec 19, 2024 | Federated Learningparameter-efficient fine-tuning | —Unverified | 0 |
| Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models | Dec 19, 2024 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| Consistency of Compositional Generalization across Multiple Levels | Dec 18, 2024 | Meta-LearningQuestion Answering | CodeCode Available | 0 |
| MedCoT: Medical Chain of Thought via Hierarchical Expert | Dec 18, 2024 | DiagnosticMedical Visual Question Answering | CodeCode Available | 1 |
| A Concept-Centric Approach to Multi-Modality Learning | Dec 18, 2024 | Image-text matchingQuestion Answering | —Unverified | 0 |
| MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants | Dec 17, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues | Dec 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering | Dec 16, 2024 | In-Context LearningInstruction Following | CodeCode Available | 0 |
| CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology | Dec 16, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track | Dec 15, 2024 | Image CaptioningMedical Question Answering | —Unverified | 0 |
| Damage Assessment after Natural Disasters with UAVs: Semantic Feature Extraction using Deep Learning | Dec 14, 2024 | Decision MakingQuestion Answering | —Unverified | 0 |
| Patch-level Sounding Object Tracking for Audio-Visual Question Answering | Dec 14, 2024 | Audio-visual Question AnsweringObject Tracking | —Unverified | 0 |
| VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation | Dec 13, 2024 | Instruction FollowingQuestion Answering | —Unverified | 0 |
| DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | Dec 13, 2024 | Chart UnderstandingMixture-of-Experts | CodeCode Available | 9 |
| ViUniT: Visual Unit Tests for More Robust Visual Programming | Dec 12, 2024 | Image GenerationImage-text matching | —Unverified | 0 |
| Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | Dec 12, 2024 | EgoSchema | CodeCode Available | 3 |
| Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine | Dec 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |