| Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding | May 14, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 7 |
| Layout Generation Agents with Large Language Models | May 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition | May 7, 2024 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing | May 7, 2024 | Image ManipulationLanguage Modeling | CodeCode Available | 4 |
| WorldGPT: Empowering LLM as Multimodal World Model | Apr 28, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Paint by Inpaint: Learning to Add Image Objects by Removing Them First | Apr 28, 2024 | Image InpaintingLanguage Modeling | CodeCode Available | 2 |
| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | —Unverified | 0 |
| Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation | Apr 23, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models | Apr 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models | Apr 18, 2024 | Fact CheckingLanguage Modeling | —Unverified | 0 |