| Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model | May 28, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| SCA: Improve Semantic Consistent in Unrestricted Adversarial Attacks via DDPM Inversion | Oct 3, 2024 | Adversarial AttackDenoising | CodeCode Available | 0 | 5 |
| Leveraging Multimodal LLM for Inspirational User Interface Search | Jan 29, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | CodeCode Available | 0 | 5 |
| Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts | Nov 18, 2024 | BenchmarkingMultimodal Large Language Model | CodeCode Available | 0 | 5 |
| Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning | Nov 17, 2024 | Image CaptioningLanguage Modeling | CodeCode Available | 0 | 5 |
| TourSynbio-Search: A Large Language Model Driven Agent Framework for Unified Search Method for Protein Engineering | Nov 9, 2024 | Information RetrievalLanguage Modeling | CodeCode Available | 0 | 5 |
| PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks | Mar 6, 2025 | document understandingLanguage Modeling | CodeCode Available | 0 | 5 |
| Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation | May 28, 2025 | Image GenerationLanguage Modeling | CodeCode Available | 0 | 5 |
| MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO | May 19, 2025 | DecoderImage Generation | CodeCode Available | 0 | 5 |
| LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery | Feb 26, 2024 | Continual LearningExemplar-Free | CodeCode Available | 0 | 5 |
| Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations | Oct 22, 2024 | Camouflaged Object SegmentationLarge Language Model | CodeCode Available | 0 | 5 |
| Visual Text Generation in the Wild | Jul 19, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 | 5 |
| Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities | Apr 2, 2025 | DescriptiveLarge Language Model | CodeCode Available | 0 | 5 |
| Layout Generation Agents with Large Language Models | May 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| TRINS: Towards Multimodal Language Models that Can Read | Jun 10, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning | May 20, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation | Mar 19, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| VGR: Visual Grounded Reasoning | Jun 13, 2025 | Large Language ModelMath | —Unverified | 0 | 0 |
| Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model | Aug 21, 2024 | Emotion RecognitionLanguage Modeling | —Unverified | 0 | 0 |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition | May 7, 2024 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese | Aug 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks | Feb 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation | Oct 11, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 | 0 |
| VL-Mamba: Exploring State Space Models for Multimodal Learning | Mar 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection | Sep 30, 2024 | Anomaly DetectionLanguage Modeling | —Unverified | 0 | 0 |
| VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks | Jul 29, 2024 | Deep LearningDomain Generalization | —Unverified | 0 | 0 |
| Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach | Oct 31, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models | May 26, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| When neural implant meets multimodal LLM: A dual-loop system for neuromodulation and naturalistic neuralbehavioral research | Mar 16, 2025 | EEGLarge Language Model | —Unverified | 0 | 0 |
| WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image | Dec 3, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 | 0 |
| Multimodal large language model for wheat breeding: a new exploration of smart breeding | Nov 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization | Dec 27, 2024 | Face SwappingImage Segmentation | —Unverified | 0 | 0 |
| AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability | May 23, 2024 | cross-modal alignmentLanguage Modelling | —Unverified | 0 | 0 |
| A Medical Multimodal Large Language Model for Pediatric Pneumonia | Sep 4, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 | 0 |
| A Neural Matrix Decomposition Recommender System Model based on the Multimodal Large Language Model | Jul 12, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions | Jun 4, 2025 | Data AugmentationDiversity | —Unverified | 0 | 0 |
| ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM | Jun 17, 2025 | HallucinationLanguage Modeling | —Unverified | 0 | 0 |
| A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges | Dec 16, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 | 0 |
| Automated radiotherapy treatment planning guided by GPT-4Vision | Jun 21, 2024 | In-Context LearningLanguage Modelling | —Unverified | 0 | 0 |
| Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction | Sep 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Beyond Retrieval: Joint Supervision and Multimodal Document Ranking for Textbook Question Answering | May 17, 2025 | Document RankingLarge Language Model | —Unverified | 0 | 0 |
| Beyond Text: Implementing Multimodal Large Language Model-Powered Multi-Agent Systems Using a No-Code Platform | Jan 1, 2025 | Code GenerationImage Generation | —Unverified | 0 | 0 |
| BlueLM-2.5-3B Technical Report | Jul 8, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 | 0 |
| CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches | Sep 26, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring | May 20, 2025 | Automated Essay ScoringDiversity | —Unverified | 0 | 0 |
| Can Multimodal Large Language Model Think Analogically? | Nov 2, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models | Nov 11, 2024 | 2D Pose EstimationCategory-Agnostic Pose Estimation | —Unverified | 0 | 0 |
| CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion | Aug 21, 2024 | Language ModellingLarge Language Model | —Unverified | 0 | 0 |