| UrbanWorld: An Urban World Model for 3D City Generation | Jul 16, 2024 | Decision MakingLanguage Modelling | CodeCode Available | 2 |
| Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM | Jun 18, 2024 | Anomaly DetectionAnomaly Localization | CodeCode Available | 2 |
| Explore the Limits of Omni-modal Pretraining at Scale | Jun 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| A Survey of Multimodal Large Language Model from A Data-centric Perspective | May 26, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| WorldGPT: Empowering LLM as Multimodal World Model | Apr 28, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Paint by Inpaint: Learning to Add Image Objects by Removing Them First | Apr 28, 2024 | Image InpaintingLanguage Modeling | CodeCode Available | 2 |
| LaVy: Vietnamese Multimodal Large Language Model | Apr 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| UMBRAE: Unified Multimodal Brain Decoding | Apr 10, 2024 | Brain DecodingLanguage Modeling | CodeCode Available | 2 |
| Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want | Mar 29, 2024 | Instruction FollowingLanguage Modelling | CodeCode Available | 2 |
| CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 7, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |
| Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast | Feb 13, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| Jailbreaking Attack against Multimodal Large Language Model | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning | Jan 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge | Jan 1, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| LLMGA: Multimodal Large Language Model based Generation Assistant | Nov 27, 2023 | Image GenerationLanguage Modeling | CodeCode Available | 2 |
| MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | Jun 23, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | Jun 7, 2023 | Cross-Modal RetrievalLanguage Modelling | CodeCode Available | 2 |
| MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis | Jun 23, 2025 | DiagnosticLarge Language Model | CodeCode Available | 1 |
| The Condition Number as a Scale-Invariant Proxy for Information Encoding in Neural Units | Jun 19, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP | May 30, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model | May 30, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution | May 27, 2025 | 8kAvg | CodeCode Available | 1 |
| Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging | May 26, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion | May 26, 2025 | DenoisingImage Generation | CodeCode Available | 1 |
| ChemMLLM: Chemical Multimodal Large Language Model | May 22, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation | May 19, 2025 | Binary ClassificationDeepFake Detection | CodeCode Available | 1 |
| Unifying Segment Anything in Microscopy with Multimodal Large Language Model | May 16, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding | Apr 17, 2025 | Image GenerationLarge Language Model | CodeCode Available | 1 |
| AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection | Apr 16, 2025 | Anomaly DetectionLarge Language Model | CodeCode Available | 1 |
| Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMs | Apr 10, 2025 | Multimodal Large Language ModelTime Series | CodeCode Available | 1 |
| Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions | Mar 20, 2025 | 2D Object DetectionDistributed Computing | CodeCode Available | 1 |
| Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space | Mar 14, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Towards General Visual-Linguistic Face Forgery Detection(V2) | Feb 28, 2025 | HallucinationLanguage Modeling | CodeCode Available | 1 |
| Towards Text-Image Interleaved Retrieval | Feb 18, 2025 | Information RetrievalLanguage Modeling | CodeCode Available | 1 |
| PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures | Jan 25, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery | Jan 20, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis | Jan 17, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding | Jan 14, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering | Jan 1, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| MiniGPT-Pancreas: Multimodal Large Language Model for Pancreas Cancer Classification and Detection | Dec 20, 2024 | Cancer ClassificationChatbot | CodeCode Available | 1 |
| IDEA-Bench: How Far are Generative Models from Professional Designing? | Dec 16, 2024 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations | Dec 9, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 |
| Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning | Nov 18, 2024 | AttributeCompositional Zero-Shot Learning | CodeCode Available | 1 |
| Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model | Nov 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Meaning Typed Prompting: A Technique for Efficient, Reliable Structured Output Generation | Oct 22, 2024 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation | Oct 17, 2024 | Decision MakingLanguage Modeling | CodeCode Available | 1 |
| Hespi: A pipeline for automatically detecting information from hebarium specimen sheets | Oct 11, 2024 | Handwritten Text RecognitionHTR | CodeCode Available | 1 |