| SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge | May 23, 2024 | Question AnsweringRAG | —Unverified | 0 |
| Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models | May 23, 2024 | Mixture-of-ExpertsVisual Question Answering | CodeCode Available | 2 |
| Calibrated Self-Rewarding Vision Language Models | May 23, 2024 | HallucinationLanguage Modelling | CodeCode Available | 2 |
| PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery | May 22, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 |
| Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering | May 21, 2024 | DiversityInformation Retrieval | CodeCode Available | 0 |
| MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering | May 20, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| Imp: Highly Capable Large Multimodal Models for Mobile Devices | May 20, 2024 | QuantizationVisual Question Answering | CodeCode Available | 2 |
| Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning | May 19, 2024 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging | May 18, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts | May 18, 2024 | Mixture-of-ExpertsVisual Question Answering | CodeCode Available | 5 |
| StackOverflowVQA: Stack Overflow Visual Question Answering Dataset | May 17, 2024 | Question AnsweringSentence | —Unverified | 0 |
| Efficient Multimodal Large Language Models: A Survey | May 17, 2024 | Edge-computingQuestion Answering | CodeCode Available | 3 |
| UniRAG: Universal Retrieval Augmentation for Large Vision Language Models | May 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Chameleon: Mixed-Modal Early-Fusion Foundation Models | May 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 7 |
| Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model | May 15, 2024 | GPULanguage Modeling | CodeCode Available | 2 |
| CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering | May 13, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI | May 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Federated Document Visual Question Answering: A Pilot Study | May 10, 2024 | Federated LearningQuestion Answering | CodeCode Available | 0 |
| CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | May 9, 2024 | Image CaptioningInstruction Following | CodeCode Available | 2 |
| Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering | May 8, 2024 | 2kEmbodied Question Answering | —Unverified | 0 |
| VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images | May 6, 2024 | AttributeLanguage Modeling | —Unverified | 0 |
| Language-Image Models with 3D Understanding | May 6, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Advancing Multimodal Medical Capabilities of Gemini | May 6, 2024 | Computed Tomography (CT)image-classification | —Unverified | 0 |
| OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning | May 2, 2024 | Autonomous Drivingcounterfactual | CodeCode Available | 4 |