| Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering | Jun 4, 2024 | Data AugmentationMachine Translation | —Unverified | 0 |
| Re-ReST: Reflection-Reinforced Self-Training for Language Agents | Jun 3, 2024 | Code GenerationImage Generation | CodeCode Available | 1 |
| Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | Jun 3, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 |
| Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering | Jun 3, 2024 | DiversityQuestion Answering | —Unverified | 0 |
| Selectively Answering Visual Questions | Jun 3, 2024 | AvgIn-Context Learning | —Unverified | 0 |
| Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera | May 30, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VQA Training Sets are Self-play Environments for Generating Few-shot Pools | May 30, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Enhancing Large Vision Language Models with Self-Training on Image Comprehension | May 30, 2024 | Image ComprehensionVisual Question Answering | CodeCode Available | 2 |
| Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA | May 30, 2024 | DiagnosticMedical Diagnosis | CodeCode Available | 1 |
| Instruction-Guided Visual Masking | May 30, 2024 | Instruction FollowingVisual Grounding | CodeCode Available | 1 |
| Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals | May 30, 2024 | counterfactualQuestion Answering | —Unverified | 0 |
| Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs | May 29, 2024 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks | May 29, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification | May 29, 2024 | HallucinationImage Captioning | —Unverified | 0 |
| Data-augmented phrase-level alignment for mitigating object hallucination | May 28, 2024 | Data AugmentationHallucination | —Unverified | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness | May 27, 2024 | HallucinationImage Captioning | CodeCode Available | 11 |
| Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | May 24, 2024 | HallucinationImage Comprehension | CodeCode Available | 2 |
| Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models | May 24, 2024 | Common Sense ReasoningLanguage Modelling | CodeCode Available | 2 |
| ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models | May 24, 2024 | Visual Question Answering | CodeCode Available | 2 |
| Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models | May 24, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning | May 23, 2024 | Logical Reasoning Question AnsweringSpatial Reasoning | CodeCode Available | 0 |
| LOVA3: Learning to Visual Question Answering, Asking and Assessment | May 23, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| A Survey on Vision-Language-Action Models for Embodied AI | May 23, 2024 | Image CaptioningInstruction Following | CodeCode Available | 4 |
| SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge | May 23, 2024 | Question AnsweringRAG | —Unverified | 0 |