| Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps | Jun 14, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering | Jun 14, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Yo'LLaVA: Your Personalized Language and Vision Assistant | Jun 13, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 2 |
| Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns | Jun 13, 2024 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| Towards Vision-Language Geo-Foundation Model: A Survey | Jun 13, 2024 | Earth ObservationImage Captioning | CodeCode Available | 2 |
| Towards Multilingual Audio-Visual Question Answering | Jun 13, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 |
| Explore the Limits of Omni-modal Pretraining at Scale | Jun 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Advancing High Resolution Vision-Language Models in Biomedicine | Jun 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| What If We Recaption Billions of Web Images with LLaMA-3? | Jun 12, 2024 | Cross-Modal RetrievalImage Generation | —Unverified | 0 |
| VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks | Jun 12, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 5 |
| DistilDoc: Knowledge Distillation for Visually-Rich Document Applications | Jun 12, 2024 | document-image-classificationDocument Image Classification | —Unverified | 0 |
| Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning | Jun 11, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent | Jun 11, 2024 | AI AgentDescriptive | CodeCode Available | 2 |
| VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text | Jun 10, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024 | Jun 10, 2024 | Language Modellingobject-detection | —Unverified | 0 |
| CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark | Jun 10, 2024 | DiversityQuestion Answering | —Unverified | 0 |
| Towards Semantic Equivalence of Tokenization in Multimodal LLM | Jun 7, 2024 | Visual Question Answering | —Unverified | 0 |
| DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | Jun 6, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation | Jun 6, 2024 | Common Sense ReasoningMamba | —Unverified | 0 |
| Understanding Information Storage and Transfer in Multi-modal Large Language Models | Jun 6, 2024 | Factual Visual Question AnsweringModel Editing | —Unverified | 0 |
| Balancing Performance and Efficiency in Zero-shot Robotic Navigation | Jun 5, 2024 | Computational EfficiencyQuestion Answering | —Unverified | 0 |
| Wings: Learning Multimodal LLMs without Text-only Forgetting | Jun 5, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 5 |
| From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks | Jun 4, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 |
| Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following | Jun 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges | Jun 4, 2024 | Question AnsweringStory Generation | —Unverified | 0 |
| Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering | Jun 4, 2024 | Data AugmentationMachine Translation | —Unverified | 0 |
| Re-ReST: Reflection-Reinforced Self-Training for Language Agents | Jun 3, 2024 | Code GenerationImage Generation | CodeCode Available | 1 |
| Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | Jun 3, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 |
| Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering | Jun 3, 2024 | DiversityQuestion Answering | —Unverified | 0 |
| Selectively Answering Visual Questions | Jun 3, 2024 | AvgIn-Context Learning | —Unverified | 0 |
| Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera | May 30, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VQA Training Sets are Self-play Environments for Generating Few-shot Pools | May 30, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Enhancing Large Vision Language Models with Self-Training on Image Comprehension | May 30, 2024 | Image ComprehensionVisual Question Answering | CodeCode Available | 2 |
| Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA | May 30, 2024 | DiagnosticMedical Diagnosis | CodeCode Available | 1 |
| Instruction-Guided Visual Masking | May 30, 2024 | Instruction FollowingVisual Grounding | CodeCode Available | 1 |
| Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals | May 30, 2024 | counterfactualQuestion Answering | —Unverified | 0 |
| Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs | May 29, 2024 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks | May 29, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification | May 29, 2024 | HallucinationImage Captioning | —Unverified | 0 |
| Data-augmented phrase-level alignment for mitigating object hallucination | May 28, 2024 | Data AugmentationHallucination | —Unverified | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness | May 27, 2024 | HallucinationImage Captioning | CodeCode Available | 11 |
| Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | May 24, 2024 | HallucinationImage Comprehension | CodeCode Available | 2 |
| Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models | May 24, 2024 | Common Sense ReasoningLanguage Modelling | CodeCode Available | 2 |
| ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models | May 24, 2024 | Visual Question Answering | CodeCode Available | 2 |
| Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models | May 24, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning | May 23, 2024 | Logical Reasoning Question AnsweringSpatial Reasoning | CodeCode Available | 0 |
| LOVA3: Learning to Visual Question Answering, Asking and Assessment | May 23, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| A Survey on Vision-Language-Action Models for Embodied AI | May 23, 2024 | Image CaptioningInstruction Following | CodeCode Available | 4 |
| SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge | May 23, 2024 | Question AnsweringRAG | —Unverified | 0 |