| Language-Image Models with 3D Understanding | May 6, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Language Is Not All You Need: Aligning Perception with Language Models | Feb 27, 2023 | AllImage Captioning | —Unverified | 0 |
| Large Scale Scene Text Verification with Guided Attention | Apr 23, 2018 | Question AnsweringScene Text Detection | —Unverified | 0 |
| Large Vision-Language Models for Remote Sensing Visual Question Answering | Nov 16, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Latent Variable Models for Visual Question Answering | Jan 16, 2021 | BenchmarkingQuestion Answering | —Unverified | 0 |
| LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement | Nov 20, 2024 | Autonomous DrivingComputational Efficiency | —Unverified | 0 |
| LAVIS: A Library for Language-Vision Intelligence | Sep 15, 2022 | BenchmarkingImage Captioning | —Unverified | 0 |
| LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering | Jan 29, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Learning Answer Embeddings for Visual Question Answering | Jun 10, 2018 | Question AnsweringTransfer Learning | —Unverified | 0 |
| Learning by Asking Questions | Dec 4, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 |
| Learning Compositional Representation for Few-shot Visual Question Answering | Feb 21, 2021 | AttributeQuestion Answering | —Unverified | 0 |
| Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models | Feb 13, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering | Apr 16, 2016 | General ClassificationHuman-Object Interaction Detection | —Unverified | 0 |
| Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues | Mar 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Learning Rich Image Region Representation for Visual Question Answering | Oct 29, 2019 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Learning Sparse Mixture of Experts for Visual Question Answering | Sep 19, 2019 | Mixture-of-ExpertsQuestion Answering | —Unverified | 0 |
| Learning Sparsity for Effective and Efficient Music Performance Question Answering | Jun 2, 2025 | Audio-visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Learning to Compose Diversified Prompts for Image Emotion Classification | Jan 26, 2022 | ClassificationEmotion Classification | —Unverified | 0 |
| Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering | Sep 11, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Learning to Disambiguate by Asking Discriminative Questions | Aug 9, 2017 | BenchmarkingImage Captioning | —Unverified | 0 |
| Neural Reasoning, Fast and Slow, for Video Question Answering | Jul 10, 2019 | Natural QuestionsQuestion Answering | —Unverified | 0 |
| Learning to Recognize the Unseen Visual Predicates | Sep 25, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Learning to Select Question-Relevant Relations for Visual Question Answering | Jun 1, 2021 | Graph AttentionQuestion Answering | —Unverified | 0 |
| Learning to Specialize with Knowledge Distillation for Visual Question Answering | Dec 1, 2018 | General ClassificationGeneral Knowledge | —Unverified | 0 |
| Learning Visual Knowledge Memory Networks for Visual Question Answering | Jun 13, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision | Apr 20, 2020 | counterfactualimage-classification | —Unverified | 0 |
| Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models | Nov 23, 2023 | Language ModellingLarge Language Model | —Unverified | 0 |
| LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? | Mar 25, 2025 | Autonomous NavigationQuestion Answering | —Unverified | 0 |
| Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model | Jun 10, 2022 | Question AnsweringTask 2 | —Unverified | 0 |
| Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation | Jul 18, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| Leveraging Medical Visual Question Answering with Supporting Facts | May 28, 2019 | DiversityMedical Visual Question Answering | —Unverified | 0 |
| Leveraging Visual Question Answering for Image-Caption Ranking | May 4, 2016 | Image RetrievalQuestion Answering | —Unverified | 0 |
| Leveraging Visual Question Answering to Improve Text-to-Image Synthesis | Oct 28, 2020 | Auxiliary LearningImage Generation | —Unverified | 0 |
| Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models | May 30, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Lightweight In-Context Tuning for Multimodal Unified Models | Oct 8, 2023 | Image CaptioningIn-Context Learning | —Unverified | 0 |
| Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning | Jun 8, 2025 | Medical Report GenerationQuestion Answering | —Unverified | 0 |
| LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation | Jul 9, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Linguistically Driven Graph Capsule Network for Visual Question Reasoning | Mar 23, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Linguistically Routing Capsule Network for Out-of-Distribution Visual Question Answering | Jan 1, 2021 | Novel ConceptsQuestion Answering | —Unverified | 0 |
| LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing | Jun 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| 利用图像描述与知识图谱增强表示的视觉问答(Exploiting Image Captions and External Knowledge as Representation Enhancement for Visual Question Answering) | Aug 1, 2021 | Image CaptioningQuestion Answering | —Unverified | 0 |
| LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning | Jun 17, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding | Jan 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound | Oct 19, 2024 | Instruction FollowingKnowledge Distillation | —Unverified | 0 |
| LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs | Apr 29, 2025 | BenchmarkingFace Generation | —Unverified | 0 |
| Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling | Aug 20, 2021 | Data AblationOptical Character Recognition | —Unverified | 0 |
| Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA | Apr 4, 2023 | Answer GenerationLanguage Modelling | —Unverified | 0 |
| Logically Consistent Loss for Visual Question Answering | Nov 19, 2020 | Multi-Task LearningQuestion Answering | —Unverified | 0 |
| LOIS: Looking Out of Instance Semantics for Visual Question Answering | Jul 26, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |