| Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images | Feb 8, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models | Feb 8, 2024 | BenchmarkingDiversity | CodeCode Available | 7 |
| Convincing Rationales for Visual Question Answering Reasoning | Feb 6, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Text-Guided Image Clustering | Feb 5, 2024 | ClusteringImage Captioning | CodeCode Available | 1 |
| Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | Feb 5, 2024 | Science Question AnsweringText-to-Video Generation | CodeCode Available | 4 |
| GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Knowledge Generation for Zero-shot Knowledge-based VQA | Feb 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Instruction Makes a Difference | Feb 1, 2024 | HallucinationInstruction Following | CodeCode Available | 0 |
| Can Generative AI Support Patients' & Caregivers' Informational Needs? Towards Task-Centric Evaluation Of AI Systems | Jan 31, 2024 | Computed Tomography (CT)Diagnostic | —Unverified | 0 |
| From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information | Jan 31, 2024 | Hallucinationobject-detection | —Unverified | 0 |
| MouSi: Poly-Visual-Expert Vision-Language Models | Jan 30, 2024 | Image SegmentationImage-text matching | CodeCode Available | 2 |
| InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model | Jan 29, 2024 | FormLanguage Modeling | —Unverified | 0 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 |
| MoE-LLaVA: Mixture of Experts for Large Vision-Language Models | Jan 29, 2024 | HallucinationMixture-of-Experts | CodeCode Available | 7 |
| LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering | Jan 29, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning | Jan 28, 2024 | Data AugmentationQuestion Answering | —Unverified | 0 |
| Free Form Medical Visual Question Answering in Radiology | Jan 23, 2024 | DiagnosticForm | —Unverified | 0 |
| Small Language Model Meets with Reinforced Vision Vocabulary | Jan 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | Jan 22, 2024 | Question AnsweringSpatial Reasoning | —Unverified | 0 |
| Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge | Jan 19, 2024 | Question AnsweringQuestion Generation | CodeCode Available | 1 |
| Veagle: Advancements in Multimodal Representation Learning | Jan 18, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 1 |
| Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation | Jan 18, 2024 | Contrastive LearningPrompt Engineering | CodeCode Available | 1 |
| COCO is "ALL'' You Need for Visual Instruction Fine-tuning | Jan 17, 2024 | AllImage Captioning | —Unverified | 0 |
| Uncovering the Full Potential of Visual Grounding Methods in VQA | Jan 15, 2024 | Question AnsweringVisual Grounding | CodeCode Available | 0 |
| BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining | Jan 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |