| Show-o: One Single Transformer to Unify Multimodal Understanding and Generation | Aug 22, 2024 | 10-shot image generation | CodeCode Available | 5 |
| Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework | Aug 21, 2024 | geo-localizationLanguage Modeling | —Unverified | 0 |
| CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering | Aug 21, 2024 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs | Aug 21, 2024 | Contrastive LearningLanguage Modeling | —Unverified | 0 |
| V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard? | Aug 20, 2024 | Few-Shot LearningIn-Context Learning | CodeCode Available | 1 |
| TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition | Aug 19, 2024 | GPUMulti-Task Learning | CodeCode Available | 0 |
| PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding | Aug 18, 2024 | Language ModellingQuestion Answering | CodeCode Available | 2 |
| FEDMEKI: A Benchmark for Scaling Medical Foundation Models via Federated Knowledge Injection | Aug 17, 2024 | Federated LearningMedical Visual Question Answering | CodeCode Available | 0 |
| Beyond the Hype: A dispassionate look at vision-language models in medical scenario | Aug 16, 2024 | Question AnsweringSpatial Reasoning | —Unverified | 0 |
| Med-PMC: Medical Personalized Multi-modal Consultation with a Proactive Ask-First-Observe-Next Paradigm | Aug 16, 2024 | Decision MakingMedical Visual Question Answering | CodeCode Available | 0 |
| A Survey on Benchmarks of Multimodal Large Language Models | Aug 16, 2024 | Question AnsweringSurvey | CodeCode Available | 2 |
| Visual Agents as Fast and Slow Thinkers | Aug 16, 2024 | Question AnsweringReasoning Segmentation | CodeCode Available | 1 |
| IIU: Independent Inference Units for Knowledge-based Visual Question Answering | Aug 15, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion | Aug 14, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| CROME: Cross-Modal Adapters for Efficient Multimodal LLM | Aug 13, 2024 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning | Aug 10, 2024 | HallucinationOptical Character Recognition | CodeCode Available | 11 |
| mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | Aug 9, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 7 |
| Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery | Aug 9, 2024 | Contrastive LearningMedical Visual Question Answering | CodeCode Available | 1 |
| Revisiting Multi-Modal LLM Evaluation | Aug 9, 2024 | Chart UnderstandingOptical Character Recognition | —Unverified | 0 |
| Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models | Aug 8, 2024 | Contrastive LearningFine-Grained Image Recognition | —Unverified | 0 |
| Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation | Aug 7, 2024 | GPUQuestion Answering | —Unverified | 0 |
| GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI | Aug 6, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| Targeted Visual Prompting for Medical Visual Question Answering | Aug 6, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 |
| Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining | Aug 5, 2024 | DecoderDepth Estimation | CodeCode Available | 7 |
| MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph | Aug 3, 2024 | AttributeContrastive Learning | —Unverified | 0 |
| Towards Flexible Evaluation for Generative Visual Question Answering | Aug 1, 2024 | DecoderGenerative Visual Question Answering | CodeCode Available | 0 |
| MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities | Aug 1, 2024 | MathMM-Vet | CodeCode Available | 3 |
| SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving | Jul 31, 2024 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |
| Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering | Jul 31, 2024 | DiagnosticHallucination | —Unverified | 0 |
| Boosting Audio Visual Question Answering via Key Semantic-Aware Cues | Jul 30, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering | Jul 30, 2024 | Code GenerationQuestion Answering | —Unverified | 0 |
| Take A Step Back: Rethinking the Two Stages in Visual Reasoning | Jul 29, 2024 | Logical ReasoningQuestion Answering | —Unverified | 0 |
| VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks | Jul 29, 2024 | Deep LearningDomain Generalization | —Unverified | 0 |
| AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering | Jul 28, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation | Jul 26, 2024 | Knowledge DistillationQuestion Answering | CodeCode Available | 2 |
| VILA^2: VILA Augmented VILA | Jul 24, 2024 | HallucinationOptical Character Recognition (OCR) | —Unverified | 0 |
| INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model | Jul 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models | Jul 23, 2024 | Computational EfficiencyImage Captioning | —Unverified | 0 |
| Learning Trimodal Relation for AVQA with Missing Modality | Jul 23, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models | Jul 22, 2024 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models | Jul 22, 2024 | DisentanglementQuestion Answering | CodeCode Available | 0 |
| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity | Jul 22, 2024 | DiversityMultiple-choice | CodeCode Available | 2 |
| QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View | Jul 18, 2024 | Action AnticipationAction Recognition | CodeCode Available | 0 |
| Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark | Jul 18, 2024 | GPUImage Retrieval | CodeCode Available | 1 |
| Multimodal Reranking for Knowledge-Intensive Visual Question Answering | Jul 17, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 |
| ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data | Jul 17, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| EchoSight: Advancing Visual-Language Models with Wiki Knowledge | Jul 17, 2024 | ArticlesQuestion Answering | —Unverified | 0 |
| TM-PATHVQA:90000+ Textless Multilingual Questions for Medical Visual Question Answering | Jul 16, 2024 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |