| MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal | Feb 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast | Feb 13, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks | Feb 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Lumos : Empowering Multimodal LLMs with Scene Text Recognition | Feb 12, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LLaVA-Docent: Instruction Tuning with Multimodal Large Language Model to Support Art Appreciation Education | Feb 9, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| Jailbreaking Attack against Multimodal Large Language Model | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs | Jan 29, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion | Jan 24, 2024 | Conditional Image GenerationDenoising | —Unverified | 0 |
| MLLMReID: Multimodal Large Language Model-based Person Re-identification | Jan 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |