| Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches | Mar 17, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Parameter Efficient Reinforcement Learning from Human Feedback | Mar 15, 2024 | Question Answeringreinforcement-learning | —Unverified | 0 |
| Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models | Mar 15, 2024 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| Knowledge Condensation and Reasoning for Knowledge-based VQA | Mar 15, 2024 | Question AnsweringReading Comprehension | —Unverified | 0 |
| VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework | Mar 14, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering | Mar 14, 2024 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 0 |
| MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | Mar 14, 2024 | In-Context LearningMixture-of-Experts | —Unverified | 0 |
| Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | Mar 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Mitigating the Impact of Attribute Editing on Face Recognition | Mar 12, 2024 | AttributeFace Recognition | —Unverified | 0 |
| Fine-tuning Large Language Models with Sequential Instructions | Mar 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Answering Diverse Questions via Text Attached with Key Audio-Visual Clues | Mar 11, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 |
| SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM | Mar 7, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use | Mar 5, 2024 | image-classificationImage Classification | —Unverified | 0 |
| CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments | Mar 5, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting | Mar 5, 2024 | In-Context LearningObject Rearrangement | —Unverified | 0 |
| Enhancing Generalization in Medical Visual Question Answering Tasks via Gradient-Guided Model Perturbation | Mar 5, 2024 | Data AugmentationMedical Visual Question Answering | —Unverified | 0 |
| InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding | Mar 3, 2024 | Visual Question Answering | —Unverified | 0 |
| A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models | Feb 28, 2024 | Image DescriptionQuestion Answering | —Unverified | 0 |
| VCD: Knowledge Base Guided Visual Commonsense Discovery in Images | Feb 27, 2024 | Decision MakingLanguage Modelling | —Unverified | 0 |
| ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks | Feb 27, 2024 | Domain GeneralizationImage Captioning | —Unverified | 0 |
| Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning | Feb 26, 2024 | Data Augmentationdocument understanding | —Unverified | 0 |
| LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery | Feb 26, 2024 | Continual LearningExemplar-Free | CodeCode Available | 0 |
| RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis | Feb 25, 2024 | Code GenerationMultimodal Reasoning | —Unverified | 0 |
| VISREAS: Complex Visual Reasoning with Unanswerable Questions | Feb 23, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Multimodal Transformer With a Low-Computational-Cost Guarantee | Feb 23, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 |
| CommVQA: Situating Visual Question Answering in Communicative Contexts | Feb 22, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering | Feb 20, 2024 | Knowledge GraphsQuestion Answering | —Unverified | 0 |
| Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions | Feb 20, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models | Feb 19, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning | Feb 18, 2024 | HallucinationVisual Question Answering | —Unverified | 0 |
| II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering | Feb 16, 2024 | Question AnsweringTriplet | CodeCode Available | 0 |
| PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter | Feb 16, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models | Feb 16, 2024 | Adversarial RobustnessLanguage Modelling | —Unverified | 0 |
| Prompt-based Personalized Federated Learning for Medical Visual Question Answering | Feb 15, 2024 | Federated LearningMedical Visual Question Answering | —Unverified | 0 |
| Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays | Feb 14, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Visually Dehallucinative Instruction Generation | Feb 13, 2024 | HallucinationLanguage Modeling | CodeCode Available | 0 |
| Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks | Feb 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models | Feb 13, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data | Feb 12, 2024 | DecoderMarketing | CodeCode Available | 0 |
| PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs | Feb 12, 2024 | Instruction FollowingLogical Reasoning | —Unverified | 0 |
| CIC: A Framework for Culturally-Aware Image Captioning | Feb 8, 2024 | DescriptiveImage Captioning | —Unverified | 0 |
| Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images | Feb 8, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Convincing Rationales for Visual Question Answering Reasoning | Feb 6, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Knowledge Generation for Zero-shot Knowledge-based VQA | Feb 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Instruction Makes a Difference | Feb 1, 2024 | HallucinationInstruction Following | CodeCode Available | 0 |
| Can Generative AI Support Patients' & Caregivers' Informational Needs? Towards Task-Centric Evaluation Of AI Systems | Jan 31, 2024 | Computed Tomography (CT)Diagnostic | —Unverified | 0 |
| From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information | Jan 31, 2024 | Hallucinationobject-detection | —Unverified | 0 |
| InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model | Jan 29, 2024 | FormLanguage Modeling | —Unverified | 0 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 |
| LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering | Jan 29, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |