| SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge | May 23, 2024 | Question AnsweringRAG | —Unverified | 0 |
| Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models | May 23, 2024 | Mixture-of-ExpertsVisual Question Answering | CodeCode Available | 2 |
| Calibrated Self-Rewarding Vision Language Models | May 23, 2024 | HallucinationLanguage Modelling | CodeCode Available | 2 |
| PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery | May 22, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 |
| Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering | May 21, 2024 | DiversityInformation Retrieval | CodeCode Available | 0 |
| MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering | May 20, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| Imp: Highly Capable Large Multimodal Models for Mobile Devices | May 20, 2024 | QuantizationVisual Question Answering | CodeCode Available | 2 |
| Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning | May 19, 2024 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts | May 18, 2024 | Mixture-of-ExpertsVisual Question Answering | CodeCode Available | 5 |
| EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging | May 18, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| StackOverflowVQA: Stack Overflow Visual Question Answering Dataset | May 17, 2024 | Question AnsweringSentence | —Unverified | 0 |
| Efficient Multimodal Large Language Models: A Survey | May 17, 2024 | Edge-computingQuestion Answering | CodeCode Available | 3 |
| UniRAG: Universal Retrieval Augmentation for Large Vision Language Models | May 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Chameleon: Mixed-Modal Early-Fusion Foundation Models | May 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 7 |
| Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model | May 15, 2024 | GPULanguage Modeling | CodeCode Available | 2 |
| CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering | May 13, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI | May 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Federated Document Visual Question Answering: A Pilot Study | May 10, 2024 | Federated LearningQuestion Answering | CodeCode Available | 0 |
| CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | May 9, 2024 | Image CaptioningInstruction Following | CodeCode Available | 2 |
| Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering | May 8, 2024 | 2kEmbodied Question Answering | —Unverified | 0 |
| VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images | May 6, 2024 | AttributeLanguage Modeling | —Unverified | 0 |
| Advancing Multimodal Medical Capabilities of Gemini | May 6, 2024 | Computed Tomography (CT)image-classification | —Unverified | 0 |
| Language-Image Models with 3D Understanding | May 6, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning | May 2, 2024 | Autonomous Drivingcounterfactual | CodeCode Available | 4 |
| Understanding Figurative Meaning through Explainable Visual Entailment | May 2, 2024 | Question AnsweringVisual Entailment | CodeCode Available | 1 |
| Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis | May 1, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| CREPE: Coordinate-Aware End-to-End Document Parser | May 1, 2024 | document understandingOptical Character Recognition (OCR) | —Unverified | 0 |
| Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach | May 1, 2024 | Computational EfficiencyQuestion Answering | —Unverified | 0 |
| TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains | Apr 30, 2024 | Language ModellingLarge Language Model | CodeCode Available | 1 |
| Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism | Apr 29, 2024 | document understandingGPU | CodeCode Available | 0 |
| ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images | Apr 29, 2024 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 1 |
| List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs | Apr 25, 2024 | Visual GroundingVisual Question Answering | CodeCode Available | 2 |
| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | —Unverified | 0 |
| Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models | Apr 25, 2024 | Medical Visual Question Answeringparameter-efficient fine-tuning | —Unverified | 0 |
| Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering | Apr 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs | Apr 23, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| GSCo: Towards Generalizable AI in Medicine via Generalist-Specialist Collaboration | Apr 23, 2024 | Collaborative InferenceIn-Context Learning | CodeCode Available | 2 |
| Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray | Apr 23, 2024 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models | Apr 22, 2024 | Answer Generationimage-classification | —Unverified | 0 |
| Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering | Apr 22, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers | Apr 21, 2024 | DiagnosticImage Captioning | CodeCode Available | 0 |
| Exploring Diverse Methods in Visual Question Answering | Apr 21, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| LaPA: Latent Prompt Assist Model For Medical Visual Question Answering | Apr 19, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering | Apr 19, 2024 | ArticlesInformation Retrieval | —Unverified | 0 |
| Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning | Apr 19, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| TextSquare: Scaling up Text-Centric Visual Instruction Tuning | Apr 19, 2024 | HallucinationHallucination Evaluation | —Unverified | 0 |
| Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering | Apr 18, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale | Apr 18, 2024 | Decision MakingMedical Visual Question Answering | —Unverified | 0 |
| Self-Supervised Visual Preference Alignment | Apr 16, 2024 | 8kMM-Vet | CodeCode Available | 2 |