| Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy | Feb 27, 2025 | Large Language ModelMinecraft | —Unverified | 0 |
| AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs | Feb 27, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| Introducing Visual Perception Token into Multimodal Large Language Model | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep Reasoning | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models | Feb 22, 2025 | document understandingKey Information Extraction | CodeCode Available | 0 |
| Towards Text-Image Interleaved Retrieval | Feb 18, 2025 | Information RetrievalLanguage Modeling | CodeCode Available | 1 |
| Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders | Feb 18, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation | Feb 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring | Feb 16, 2025 | Instance SegmentationLanguage Modeling | —Unverified | 0 |
| Distraction is All You Need for Multimodal Large Language Model Jailbreaking | Feb 15, 2025 | AllLanguage Modeling | —Unverified | 0 |
| mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data | Feb 12, 2025 | cross-modal alignmentLarge Language Model | CodeCode Available | 2 |
| On Fairness of Unified Multimodal Large Language Model for Image Generation | Feb 5, 2025 | FairnessImage Generation | —Unverified | 0 |
| MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving | Feb 4, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Leveraging Multimodal LLM for Inspirational User Interface Search | Jan 29, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Learning Free Token Reduction for Multi-Modal Large Language Models | Jan 29, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures | Jan 25, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding | Jan 25, 2025 | Action UnderstandingEmotion Recognition | —Unverified | 0 |
| EventVL: Understand Event Streams via Multimodal Large Language Model | Jan 23, 2025 | Event-based visionLanguage Modeling | —Unverified | 0 |
| VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model | Jan 21, 2025 | Image GenerationInstruction Following | CodeCode Available | 3 |
| EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery | Jan 20, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis | Jan 17, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| Interpretable Droplet Digital PCR Assay for Trustworthy Molecular Diagnostics | Jan 16, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding | Jan 14, 2025 | Feature CompressionLanguage Modeling | CodeCode Available | 2 |
| 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding | Jan 14, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding | Jan 14, 2025 | image-classificationImage Classification | CodeCode Available | 2 |
| Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks | Jan 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation | Jan 11, 2025 | Chart UnderstandingCode Generation | CodeCode Available | 2 |
| Valley2: Exploring Multimodal Models with Scalable Vision-Language Design | Jan 10, 2025 | Image CaptioningLanguage Modeling | CodeCode Available | 3 |
| MinMo: A Multimodal Large Language Model for Seamless Voice Interaction | Jan 10, 2025 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding | Jan 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models | Jan 3, 2025 | Binary ClassificationFace Anti-Spoofing | —Unverified | 0 |
| GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model | Jan 1, 2025 | AttributeLanguage Modeling | —Unverified | 0 |
| Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering | Jan 1, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation | Jan 1, 2025 | Autonomous DrivingAutonomous Vehicles | —Unverified | 0 |
| Beyond Text: Implementing Multimodal Large Language Model-Powered Multi-Agent Systems Using a No-Code Platform | Jan 1, 2025 | Code GenerationImage Generation | —Unverified | 0 |
| ST^3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming | Dec 28, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MLLM-SUL: Multimodal Large Language Model for Semantic Scene Understanding and Localization in Traffic Scenarios | Dec 27, 2024 | Autonomous DrivingLanguage Modeling | CodeCode Available | 0 |
| A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization | Dec 27, 2024 | Face SwappingImage Segmentation | —Unverified | 0 |
| SubstationAI: Multimodal Large Model-Based Approaches for Analyzing Substation Equipment Faults | Dec 22, 2024 | Data AugmentationFault Diagnosis | —Unverified | 0 |
| MiniGPT-Pancreas: Multimodal Large Language Model for Pancreas Cancer Classification and Detection | Dec 20, 2024 | Cancer ClassificationChatbot | CodeCode Available | 1 |
| J-EDI QA: Benchmark for deep-sea organism-specific multimodal LLM | Dec 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering | Dec 19, 2024 | Contrastive LearningLanguage Modeling | CodeCode Available | 0 |
| Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation | Dec 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges | Dec 16, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| IDEA-Bench: How Far are Generative Models from Professional Designing? | Dec 16, 2024 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond | Dec 16, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM | Dec 12, 2024 | Image ComprehensionImage Generation | —Unverified | 0 |
| Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine | Dec 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework | Dec 11, 2024 | GPULanguage Modeling | —Unverified | 0 |
| DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation | Dec 10, 2024 | Image GenerationLanguage Modelling | —Unverified | 0 |