| Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization | Nov 15, 2024 | HallucinationHallucination Evaluation | —Unverified | 0 | 0 |
| MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning | Sep 9, 2024 | Federated LearningImage Captioning | —Unverified | 0 | 0 |
| MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation | Mar 23, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval | May 26, 2025 | Image RetrievalLarge Language Model | —Unverified | 0 | 0 |
| MLLMReID: Multimodal Large Language Model-based Person Re-identification | Jan 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal | Feb 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation | Feb 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| MobileFlow: A Multimodal LLM For Mobile GUI Agent | Jul 5, 2024 | Action AnalysisLanguage Modelling | —Unverified | 0 | 0 |
| MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn Motion Comprehension and Description | Oct 15, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills | May 9, 2025 | Image RetouchingLarge Language Model | —Unverified | 0 | 0 |
| MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models | Dec 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving | Feb 4, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Jul 4, 2023 | document understandingLanguage Modeling | —Unverified | 0 | 0 |
| mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model | Nov 30, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| MRIR: Integrating Multimodal Insights for Diffusion-based Realistic Image Restoration | Jul 4, 2024 | DenoisingImage Restoration | —Unverified | 0 | 0 |
| MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception | Jun 22, 2024 | Common Sense ReasoningLanguage Modelling | —Unverified | 0 | 0 |
| Multimodal Large Language Model Driven Scenario Testing for Autonomous Vehicles | Sep 10, 2024 | Autonomous VehiclesLanguage Modeling | —Unverified | 0 | 0 |
| Multimodal Large Language Model for Visual Navigation | Oct 12, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation | Apr 23, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 | 0 |
| Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model | Sep 1, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Multimodal Transformer for Comics Text-Cloze | Mar 6, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| ObjectFinder: An Open-Vocabulary Assistive System for Interactive Object Search by Blind People | Dec 4, 2024 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects | Oct 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning | Mar 14, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models | Feb 22, 2025 | document understandingKey Information Extraction | —Unverified | 0 | 0 |
| OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions | May 27, 2025 | Audio-Visual SynchronizationConversational Response Generation | —Unverified | 0 | 0 |
| Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks | Jan 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| On Fairness of Unified Multimodal Large Language Model for Image Generation | Feb 5, 2025 | FairnessImage Generation | —Unverified | 0 | 0 |
| On Path to Multimodal Generalist: General-Level and General-Bench | May 7, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model | May 25, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources | Apr 1, 2025 | GPULarge Language Model | —Unverified | 0 | 0 |
| Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy | Feb 27, 2025 | Large Language ModelMinecraft | —Unverified | 0 | 0 |
| Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training | Mar 31, 2025 | GPULanguage Modeling | —Unverified | 0 | 0 |
| ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling | May 19, 2025 | Graph GenerationKnowledge Distillation | —Unverified | 0 | 0 |
| OrthoDoc: Multimodal Large Language Model for Assisting Diagnosis in Computed Tomography | Aug 30, 2024 | Computed Tomography (CT)Diagnostic | —Unverified | 0 | 0 |
| PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis | Aug 18, 2024 | Aspect-Based Sentiment AnalysisAspect-Based Sentiment Analysis (ABSA) | —Unverified | 0 | 0 |
| Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin | Jun 5, 2025 | Large Language ModelMorphological Analysis | —Unverified | 0 | 0 |
| PHRASED: Phrase Dictionary Biasing for Speech Translation | Jun 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks | Mar 6, 2025 | document understandingLanguage Modeling | —Unverified | 0 | 0 |
| Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model | Apr 9, 2025 | Image Quality AssessmentImage Restoration | —Unverified | 0 | 0 |
| RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models | Apr 18, 2024 | Fact CheckingLanguage Modeling | —Unverified | 0 | 0 |
| Realistic Corner Case Generation for Autonomous Vehicles with Multimodal Large Language Model | Nov 29, 2024 | Autonomous VehiclesLanguage Modeling | —Unverified | 0 | 0 |
| RespLLM: Unifying Audio and Text with Multimodal LLMs for Generalized Respiratory Health Prediction | Oct 7, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation | May 30, 2025 | Autonomous DrivingAutonomous Vehicles | —Unverified | 0 | 0 |
| S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation | Jan 1, 2025 | Autonomous DrivingAutonomous Vehicles | —Unverified | 0 | 0 |
| Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation | May 27, 2024 | Instruction FollowingLanguage Modeling | —Unverified | 0 | 0 |
| SemGrasp: Semantic Grasp Generation via Language Aligned Discretization | Apr 4, 2024 | Grasp GenerationLanguage Modeling | —Unverified | 0 | 0 |