| Paint by Inpaint: Learning to Add Image Objects by Removing Them First | Apr 28, 2024 | Image InpaintingLanguage Modeling | CodeCode Available | 2 |
| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | —Unverified | 0 |
| Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation | Apr 23, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models | Apr 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models | Apr 18, 2024 | Fact CheckingLanguage Modeling | —Unverified | 0 |
| Deep Learning and LLM-based Methods Applied to Stellar Lightcurve Classification | Apr 16, 2024 | Feature EngineeringLanguage Modeling | CodeCode Available | 3 |
| LaVy: Vietnamese Multimodal Large Language Model | Apr 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| UMBRAE: Unified Multimodal Brain Decoding | Apr 10, 2024 | Brain DecodingLanguage Modeling | CodeCode Available | 2 |
| GUIDE: Graphical User Interface Data for Execution | Apr 9, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security | Apr 8, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation | Apr 8, 2024 | Image GenerationImage-to-Image Translation | CodeCode Available | 3 |
| MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | Apr 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| SemGrasp: Semantic Grasp Generation via Language Aligned Discretization | Apr 4, 2024 | Grasp GenerationLanguage Modeling | —Unverified | 0 |
| LITE: Modeling Environmental Ecosystems with Multimodal Large Language Models | Apr 1, 2024 | Decision MakingLanguage Modeling | CodeCode Available | 1 |
| Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want | Mar 29, 2024 | Instruction FollowingLanguage Modelling | CodeCode Available | 2 |
| Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition | Mar 22, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| VL-Mamba: Exploring State Space Models for Multimodal Learning | Mar 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | Mar 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 7, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |
| Multimodal Transformer for Comics Text-Cloze | Mar 6, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception | Mar 5, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection | Mar 5, 2024 | Concept AlignmentExplanation Generation | —Unverified | 0 |
| MIKO: Multimodal Intention Knowledge Distillation from Large Language Models for Social-Media Commonsense Discovery | Feb 28, 2024 | Knowledge DistillationLanguage Modeling | —Unverified | 0 |
| ShapeLLM: Universal 3D Object Understanding for Embodied Interaction | Feb 27, 2024 | 3D geometry3D Object Captioning | CodeCode Available | 3 |
| LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery | Feb 26, 2024 | Continual LearningExemplar-Free | CodeCode Available | 0 |