| As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks? | Mar 19, 2024 | Adversarial AttackImage Captioning | —Unverified | 0 |
| Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models | Mar 19, 2024 | Instruction Followingvisual instruction following | CodeCode Available | 2 |
| SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors | Mar 18, 2024 | HallucinationMotion Planning | —Unverified | 0 |
| FlexCap: Describe Anything in Images in Controllable Detail | Mar 18, 2024 | AttributeDense Captioning | —Unverified | 0 |
| Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis | Mar 18, 2024 | In-Context LearningQuestion Answering | —Unverified | 0 |
| SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant | Mar 17, 2024 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches | Mar 17, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Knowledge Condensation and Reasoning for Knowledge-based VQA | Mar 15, 2024 | Question AnsweringReading Comprehension | —Unverified | 0 |
| Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models | Mar 15, 2024 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| Parameter Efficient Reinforcement Learning from Human Feedback | Mar 15, 2024 | Question Answeringreinforcement-learning | —Unverified | 0 |
| Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering | Mar 14, 2024 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 0 |
| MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | Mar 14, 2024 | In-Context LearningMixture-of-Experts | —Unverified | 0 |
| VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework | Mar 14, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Can We Talk Models Into Seeing the World Differently? | Mar 14, 2024 | Image CaptioningImage Classification | CodeCode Available | 1 |
| Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | Mar 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Fine-tuning Large Language Models with Sequential Instructions | Mar 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Mitigating the Impact of Attribute Editing on Face Recognition | Mar 12, 2024 | AttributeFace Recognition | —Unverified | 0 |
| MoAI: Mixture of All Intelligence for Large Language and Vision Models | Mar 12, 2024 | AllMixture-of-Experts | CodeCode Available | 3 |
| Beyond Text: Frozen Large Language Models in Visual Signal Comprehension | Mar 12, 2024 | DeblurringDecoder | CodeCode Available | 2 |
| Multi-modal Auto-regressive Modeling via Visual Words | Mar 12, 2024 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 1 |
| Answering Diverse Questions via Text Attached with Key Audio-Visual Clues | Mar 11, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 |
| Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models | Mar 10, 2024 | Visual Question Answering | CodeCode Available | 3 |
| DeepSeek-VL: Towards Real-World Vision-Language Understanding | Mar 8, 2024 | ChatbotLanguage Modelling | CodeCode Available | 7 |
| Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | Mar 8, 2024 | 1 Image, 2*2 StitchingCode Generation | CodeCode Available | 3 |
| SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM | Mar 7, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 7, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |
| Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning | Mar 6, 2024 | Multimodal ReasoningQuestion Answering | CodeCode Available | 2 |
| Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use | Mar 5, 2024 | image-classificationImage Classification | —Unverified | 0 |
| CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments | Mar 5, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models | Mar 5, 2024 | TextVQAVisual Question Answering | CodeCode Available | 3 |
| MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting | Mar 5, 2024 | In-Context LearningObject Rearrangement | —Unverified | 0 |
| Enhancing Generalization in Medical Visual Question Answering Tasks via Gradient-Guided Model Perturbation | Mar 5, 2024 | Data AugmentationMedical Visual Question Answering | —Unverified | 0 |
| Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review | Mar 4, 2024 | Medical Report GenerationQuestion Answering | CodeCode Available | 3 |
| InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding | Mar 3, 2024 | Visual Question Answering | —Unverified | 0 |
| The All-Seeing Project V2: Towards General Relation Comprehension of the Open World | Feb 29, 2024 | AllHallucination | CodeCode Available | 4 |
| A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models | Feb 28, 2024 | Image DescriptionQuestion Answering | —Unverified | 0 |
| ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks | Feb 27, 2024 | Domain GeneralizationImage Captioning | —Unverified | 0 |
| VCD: Knowledge Base Guided Visual Commonsense Discovery in Images | Feb 27, 2024 | Decision MakingLanguage Modelling | —Unverified | 0 |
| Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning | Feb 26, 2024 | Data Augmentationdocument understanding | —Unverified | 0 |
| LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery | Feb 26, 2024 | Continual LearningExemplar-Free | CodeCode Available | 0 |
| RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis | Feb 25, 2024 | Code GenerationMultimodal Reasoning | —Unverified | 0 |
| Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA | Feb 24, 2024 | 3D Question Answering (3D-QA)Question Answering | CodeCode Available | 1 |
| VISREAS: Complex Visual Reasoning with Unanswerable Questions | Feb 23, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Multimodal Transformer With a Low-Computational-Cost Guarantee | Feb 23, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 |
| CommVQA: Situating Visual Question Answering in Communicative Contexts | Feb 22, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Uncertainty-Aware Evaluation for Vision-Language Models | Feb 22, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 1 |
| Visual Hallucinations of Multi-modal Large Language Models | Feb 22, 2024 | DiversityHallucination | CodeCode Available | 1 |
| TinyLLaVA: A Framework of Small-scale Large Multimodal Models | Feb 22, 2024 | Visual Question Answering | CodeCode Available | 4 |
| Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment | Feb 21, 2024 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions | Feb 20, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |