| CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model | Jun 16, 2025 | Decision MakingFinancial Analysis | —Unverified | 0 | 0 |
| ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images | Apr 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| ChatGPT Meets Iris Biometrics | Aug 9, 2024 | Face RecognitionIris Recognition | —Unverified | 0 | 0 |
| ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | Jul 18, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 | 0 |
| ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model | Nov 4, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI | Jul 14, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance | Mar 13, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates | Apr 14, 2025 | Autonomous NavigationLane Detection | —Unverified | 0 | 0 |
| CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering | Mar 1, 2025 | Continual LearningLanguage Modeling | —Unverified | 0 | 0 |
| CLSP: High-Fidelity Contrastive Language-State Pre-training for Agent State Representation | Sep 24, 2024 | Contrastive LearningLanguage Modeling | —Unverified | 0 | 0 |
| CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation | Nov 30, 2023 | Image GenerationIn-Context Learning | —Unverified | 0 | 0 |
| CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation | Jan 1, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 | 0 |
| COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework | Dec 11, 2024 | GPULanguage Modeling | —Unverified | 0 | 0 |
| Comics for Everyone: Generating Accessible Text Descriptions for Comic Strips | Oct 1, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step | Jul 6, 2025 | DenoisingLarge Language Model | —Unverified | 0 | 0 |
| CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model | Nov 19, 2024 | Information RetrievalLanguage Modeling | —Unverified | 0 | 0 |
| Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic | Jul 25, 2024 | Image to textLanguage Modeling | —Unverified | 0 | 0 |
| Decoding Style: Efficient Fine-Tuning of LLMs for Image-Guided Outfit Recommendation with Preference | Sep 18, 2024 | Image CaptioningLarge Language Model | —Unverified | 0 | 0 |
| DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation | Dec 10, 2024 | Image GenerationLanguage Modelling | —Unverified | 0 | 0 |
| Distraction is All You Need for Multimodal Large Language Model Jailbreaking | Feb 15, 2025 | AllLanguage Modeling | —Unverified | 0 | 0 |
| DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing | Sep 2, 2024 | Image GenerationLanguage Modelling | —Unverified | 0 | 0 |
| DreamJourney: Perpetual View Generation with Video Diffusion Models | Jun 21, 2025 | Image to 3DLarge Language Model | —Unverified | 0 | 0 |
| DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation | Dec 4, 2024 | Image GenerationLarge Language Model | —Unverified | 0 | 0 |
| EAGLE: Egocentric AGgregated Language-video Engine | Sep 26, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 | 0 |
| EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM | Dec 12, 2024 | Image ComprehensionImage Generation | —Unverified | 0 | 0 |
| EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM | Dec 5, 2024 | Image ManipulationLanguage Modeling | —Unverified | 0 | 0 |
| EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model | Aug 21, 2024 | Computational EfficiencyLanguage Modeling | —Unverified | 0 | 0 |
| Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak | May 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios | Dec 5, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model | Dec 5, 2023 | Boundary DetectionLanguage Modeling | —Unverified | 0 | 0 |
| EventVL: Understand Event Streams via Multimodal Large Language Model | Jan 23, 2025 | Event-based visionLanguage Modeling | —Unverified | 0 | 0 |
| FaceInsight: A Multimodal Large Language Model for Face Perception | Apr 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning | Apr 9, 2025 | Action Unit DetectionAge Estimation | —Unverified | 0 | 0 |
| Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms | Oct 24, 2024 | DiversityLanguage Modeling | —Unverified | 0 | 0 |
| ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization | Oct 14, 2024 | Explanation GenerationImage Forgery Detection | —Unverified | 0 | 0 |
| From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models | Jun 2, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing | Jul 8, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 | 0 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 | 0 |
| Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders | Feb 18, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 | 0 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 | 0 |
| Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models | Jul 26, 2024 | DisentanglementLanguage Modeling | —Unverified | 0 | 0 |
| GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model | Jan 1, 2025 | AttributeLanguage Modeling | —Unverified | 0 | 0 |
| Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes | May 26, 2025 | DeepFake DetectionFace Generation | —Unverified | 0 | 0 |
| Guardrails for avoiding harmful medical product recommendations and off-label promotion in generative AI models | Jun 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| GUIDE: Graphical User Interface Data for Execution | Apr 9, 2024 | Language ModellingLarge Language Model | —Unverified | 0 | 0 |
| Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition | Mar 22, 2024 | Language ModellingLarge Language Model | —Unverified | 0 | 0 |
| HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model | Mar 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval | May 21, 2025 | AttributeImage Retrieval | —Unverified | 0 | 0 |
| HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning | May 23, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model | Nov 10, 2023 | Image CaptioningLanguage Modeling | —Unverified | 0 | 0 |