| Automated radiotherapy treatment planning guided by GPT-4Vision | Jun 21, 2024 | In-Context LearningLanguage Modelling | —Unverified | 0 |
| The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge | Jun 18, 2024 | Few-Shot Object DetectionLanguage Modeling | —Unverified | 0 |
| TRINS: Towards Multimodal Language Models that Can Read | Jun 10, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak | May 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model | May 28, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation | May 27, 2024 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM | May 24, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability | May 23, 2024 | cross-modal alignmentLanguage Modelling | —Unverified | 0 |
| Layout Generation Agents with Large Language Models | May 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition | May 7, 2024 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | CodeCode Available | 0 |
| Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation | Apr 23, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models | Apr 18, 2024 | Fact CheckingLanguage Modeling | —Unverified | 0 |
| GUIDE: Graphical User Interface Data for Execution | Apr 9, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security | Apr 8, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SemGrasp: Semantic Grasp Generation via Language Aligned Discretization | Apr 4, 2024 | Grasp GenerationLanguage Modeling | —Unverified | 0 |
| Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition | Mar 22, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| VL-Mamba: Exploring State Space Models for Multimodal Learning | Mar 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | Mar 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Multimodal Transformer for Comics Text-Cloze | Mar 6, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection | Mar 5, 2024 | Concept AlignmentExplanation Generation | —Unverified | 0 |
| MIKO: Multimodal Intention Knowledge Distillation from Large Language Models for Social-Media Commonsense Discovery | Feb 28, 2024 | Knowledge DistillationLanguage Modeling | —Unverified | 0 |
| LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery | Feb 26, 2024 | Continual LearningExemplar-Free | CodeCode Available | 0 |
| MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal | Feb 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks | Feb 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Lumos : Empowering Multimodal LLMs with Scene Text Recognition | Feb 12, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LLaVA-Docent: Instruction Tuning with Multimodal Large Language Model to Support Art Appreciation Education | Feb 9, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs | Jan 29, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion | Jan 24, 2024 | Conditional Image GenerationDenoising | —Unverified | 0 |
| MLLMReID: Multimodal Large Language Model-based Person Re-identification | Jan 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation | Jan 1, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation | Dec 24, 2023 | Common Sense ReasoningLanguage Modeling | —Unverified | 0 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 |
| EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model | Dec 5, 2023 | Boundary DetectionLanguage Modeling | —Unverified | 0 |
| MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation | Dec 4, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation | Nov 30, 2023 | Image GenerationIn-Context Learning | —Unverified | 0 |
| mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model | Nov 30, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model | Nov 10, 2023 | Image CaptioningLanguage Modeling | —Unverified | 0 |
| Multimodal Large Language Model for Visual Navigation | Oct 12, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Comics for Everyone: Generating Accessible Text Descriptions for Comic Strips | Oct 1, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Investigating the Catastrophic Forgetting in Multimodal Large Language Models | Sep 19, 2023 | image-classificationImage Classification | —Unverified | 0 |
| Imaginations of WALL-E : Reconstructing Experiences with an Imagination-Inspired Module for Advanced AI Systems | Aug 20, 2023 | Emotion RecognitionLanguage Modelling | —Unverified | 0 |
| ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | Jul 18, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Jul 4, 2023 | document understandingLanguage Modeling | CodeCode Available | 0 |
| A Survey on Multimodal Large Language Models | Jun 23, 2023 | HallucinationIn-Context Learning | CodeCode Available | 0 |
| Language Is Not All You Need: Aligning Perception with Language Models | Feb 27, 2023 | AllImage Captioning | CodeCode Available | 0 |