| OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions | May 27, 2025 | Audio-Visual SynchronizationConversational Response Generation | —Unverified | 0 |
| Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes | May 26, 2025 | DeepFake DetectionFace Generation | —Unverified | 0 |
| What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models | May 26, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval | May 26, 2025 | Image RetrievalLarge Language Model | —Unverified | 0 |
| Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models | May 26, 2025 | image-classificationImage Classification | CodeCode Available | 0 |
| OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model | May 25, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning | May 23, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning | May 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification | May 21, 2025 | Data AugmentationLarge Language Model | —Unverified | 0 |
| Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval | May 21, 2025 | AttributeImage Retrieval | —Unverified | 0 |
| MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling | May 21, 2025 | Emotion RecognitionFace Detection | —Unverified | 0 |
| UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation | May 20, 2025 | Image GenerationLanguage Modeling | —Unverified | 0 |
| CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring | May 20, 2025 | Automated Essay ScoringDiversity | —Unverified | 0 |
| UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning | May 20, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling | May 19, 2025 | Graph GenerationKnowledge Distillation | —Unverified | 0 |
| MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO | May 19, 2025 | DecoderImage Generation | CodeCode Available | 0 |
| Beyond Retrieval: Joint Supervision and Multimodal Document Ranking for Textbook Question Answering | May 17, 2025 | Document RankingLarge Language Model | —Unverified | 0 |
| Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning | May 10, 2025 | Image AugmentationLarge Language Model | CodeCode Available | 0 |
| MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills | May 9, 2025 | Image RetouchingLarge Language Model | —Unverified | 0 |
| Is your multimodal large language model a good science tutor? | May 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| On Path to Multimodal Generalist: General-Level and General-Bench | May 7, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| Consistency-aware Fake Videos Detection on Short Video Platforms | Apr 30, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 0 |
| TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation | Apr 24, 2025 | Caption GenerationDense Video Captioning | —Unverified | 0 |
| FaceInsight: A Multimodal Large Language Model for Face Perception | Apr 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images | Apr 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models | Apr 14, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates | Apr 14, 2025 | Autonomous NavigationLane Detection | —Unverified | 0 |
| Mavors: Multi-granularity Video Representation for Multimodal Large Language Model | Apr 14, 2025 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |
| Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment | Apr 10, 2025 | AI AgentAttribute | —Unverified | 0 |
| Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning | Apr 9, 2025 | Action Unit DetectionAge Estimation | —Unverified | 0 |
| MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking | Apr 9, 2025 | Autonomous DrivingLanguage Modeling | CodeCode Available | 0 |
| Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model | Apr 9, 2025 | Image Quality AssessmentImage Restoration | —Unverified | 0 |
| Towards Visual Text Grounding of Multimodal Large Language Model | Apr 7, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Universal Item Tokenization for Transferable Generative Recommendation | Apr 6, 2025 | General KnowledgeLarge Language Model | —Unverified | 0 |
| Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities | Apr 2, 2025 | DescriptiveLarge Language Model | CodeCode Available | 0 |
| Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources | Apr 1, 2025 | GPULarge Language Model | —Unverified | 0 |
| Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training | Mar 31, 2025 | GPULanguage Modeling | —Unverified | 0 |
| Dynamic Pyramid Network for Efficient Multimodal Large Language Model | Mar 26, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation | Mar 23, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation | Mar 19, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| LEGION: Learning to Ground and Explain for Synthetic Image Detection | Mar 19, 2025 | Artifact DetectionImage Manipulation | —Unverified | 0 |
| SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability | Mar 18, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model | Mar 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| When neural implant meets multimodal LLM: A dual-loop system for neuromodulation and naturalistic neuralbehavioral research | Mar 16, 2025 | EEGLarge Language Model | —Unverified | 0 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 |
| OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning | Mar 14, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance | Mar 13, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| Hybrid Agents for Image Restoration | Mar 13, 2025 | Image RestorationIn-Context Learning | —Unverified | 0 |
| Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition | Mar 10, 2025 | Disaster ResponseLarge Language Model | —Unverified | 0 |
| PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks | Mar 6, 2025 | document understandingLanguage Modeling | CodeCode Available | 0 |