| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 |
| SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant | Mar 17, 2024 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| Can We Talk Models Into Seeing the World Differently? | Mar 14, 2024 | Image CaptioningImage Classification | CodeCode Available | 1 |
| Multi-modal Auto-regressive Modeling via Visual Words | Mar 12, 2024 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 1 |
| Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA | Feb 24, 2024 | 3D Question Answering (3D-QA)Question Answering | CodeCode Available | 1 |
| Uncertainty-Aware Evaluation for Vision-Language Models | Feb 22, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 1 |
| Visual Hallucinations of Multi-modal Large Language Models | Feb 22, 2024 | DiversityHallucination | CodeCode Available | 1 |
| Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment | Feb 21, 2024 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models | Feb 16, 2024 | DiversityInstruction Following | CodeCode Available | 1 |
| Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy | Feb 11, 2024 | Language ModelingOpen Vocabulary Attribute Detection | CodeCode Available | 1 |
| Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations | Feb 10, 2024 | DiagnosticHallucination | CodeCode Available | 1 |
| Text-Guided Image Clustering | Feb 5, 2024 | ClusteringImage Captioning | CodeCode Available | 1 |
| Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge | Jan 19, 2024 | Question AnsweringQuestion Generation | CodeCode Available | 1 |
| Veagle: Advancements in Multimodal Representation Learning | Jan 18, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 1 |
| Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation | Jan 18, 2024 | Contrastive LearningPrompt Engineering | CodeCode Available | 1 |
| Cross-modal Retrieval for Knowledge-based Visual Question Answering | Jan 11, 2024 | Cross-Modal RetrievalQuestion Answering | CodeCode Available | 1 |
| MISS: A Generative Pretraining and Finetuning Approach for Med-VQA | Jan 10, 2024 | Medical Visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 |
| CaMML: Context-Aware Multimodal Learner for Large Models | Jan 6, 2024 | Visual Question Answering | CodeCode Available | 1 |
| InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | Dec 21, 2023 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 |
| EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | Dec 19, 2023 | ObjectObject Counting | CodeCode Available | 1 |
| Gemini: A Family of Highly Capable Multimodal Models | Dec 19, 2023 | 1 Image, 2*2 StitchingArithmetic Reasoning | CodeCode Available | 1 |
| HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles | Dec 18, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Privacy-Aware Document Visual Question Answering | Dec 15, 2023 | document understandingFederated Learning | CodeCode Available | 1 |
| WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data | Dec 15, 2023 | document understandingQuestion Answering | CodeCode Available | 1 |