| Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches | Mar 17, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Parameter Efficient Reinforcement Learning from Human Feedback | Mar 15, 2024 | Question Answeringreinforcement-learning | —Unverified | 0 |
| Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models | Mar 15, 2024 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| Knowledge Condensation and Reasoning for Knowledge-based VQA | Mar 15, 2024 | Question AnsweringReading Comprehension | —Unverified | 0 |
| VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework | Mar 14, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering | Mar 14, 2024 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 0 |
| MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | Mar 14, 2024 | In-Context LearningMixture-of-Experts | —Unverified | 0 |
| Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | Mar 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Mitigating the Impact of Attribute Editing on Face Recognition | Mar 12, 2024 | AttributeFace Recognition | —Unverified | 0 |
| Fine-tuning Large Language Models with Sequential Instructions | Mar 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Answering Diverse Questions via Text Attached with Key Audio-Visual Clues | Mar 11, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 |
| SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM | Mar 7, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use | Mar 5, 2024 | image-classificationImage Classification | —Unverified | 0 |
| CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments | Mar 5, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting | Mar 5, 2024 | In-Context LearningObject Rearrangement | —Unverified | 0 |
| Enhancing Generalization in Medical Visual Question Answering Tasks via Gradient-Guided Model Perturbation | Mar 5, 2024 | Data AugmentationMedical Visual Question Answering | —Unverified | 0 |
| InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding | Mar 3, 2024 | Visual Question Answering | —Unverified | 0 |
| A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models | Feb 28, 2024 | Image DescriptionQuestion Answering | —Unverified | 0 |
| VCD: Knowledge Base Guided Visual Commonsense Discovery in Images | Feb 27, 2024 | Decision MakingLanguage Modelling | —Unverified | 0 |
| ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks | Feb 27, 2024 | Domain GeneralizationImage Captioning | —Unverified | 0 |
| Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning | Feb 26, 2024 | Data Augmentationdocument understanding | —Unverified | 0 |
| LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery | Feb 26, 2024 | Continual LearningExemplar-Free | CodeCode Available | 0 |
| RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis | Feb 25, 2024 | Code GenerationMultimodal Reasoning | —Unverified | 0 |
| VISREAS: Complex Visual Reasoning with Unanswerable Questions | Feb 23, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Multimodal Transformer With a Low-Computational-Cost Guarantee | Feb 23, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 |