| DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles | Mar 5, 2025 | Domain AdaptationImage to text | CodeCode Available | 1 |
| ABC: Achieving Better Control of Multimodal Embeddings using VLMs | Mar 1, 2025 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation | Feb 26, 2025 | Cross-Modal RetrievalHallucination | —Unverified | 0 |
| Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models | Feb 18, 2025 | Image to textOptical Character Recognition | CodeCode Available | 0 |
| Natural Language Generation from Visual Sequences: Challenges and Future Directions | Feb 18, 2025 | Image to textText Generation | —Unverified | 0 |
| Magma: A Foundation Model for Multimodal AI Agents | Feb 18, 2025 | Autonomous Web NavigationImage to text | CodeCode Available | 5 |
| UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation | Feb 16, 2025 | Binary ClassificationFake News Detection | —Unverified | 0 |
| UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding | Feb 8, 2025 | DenoisingImage Generation | CodeCode Available | 1 |
| Multi-LLM Collaborative Caption Generation in Scientific Documents | Jan 5, 2025 | Caption GenerationImage to text | CodeCode Available | 0 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 |