| Multimodal Procedural Planning via Dual Text-Image Prompting | May 2, 2023 | Image GenerationImage to text | CodeCode Available | 1 | 5 |
| What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs | Jun 19, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 | 5 |
| Vision-Language Dataset Distillation | Aug 15, 2023 | Dataset Distillationimage-classification | CodeCode Available | 1 | 5 |
| Multimodal Foundation Models For Echocardiogram Interpretation | Aug 29, 2023 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 1 | 5 |
| See or Guess: Counterfactually Regularized Image Captioning | Aug 29, 2024 | Causal Inferencecounterfactual | CodeCode Available | 1 | 5 |
| L-Verse: Bidirectional Generation Between Image and Text | Nov 22, 2021 | Image CaptioningImage Generation | CodeCode Available | 1 | 5 |
| DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles | Mar 5, 2025 | Domain AdaptationImage to text | CodeCode Available | 1 | 5 |
| Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models | Nov 27, 2023 | Cross-Modal RetrievalImage Generation | CodeCode Available | 1 | 5 |
| FETA: Towards Specializing Foundation Models for Expert Task Applications | Sep 8, 2022 | Domain GeneralizationFew-Shot Learning | CodeCode Available | 1 | 5 |
| Write and Paint: Generative Vision-Language Models are Unified Modal Learners | Jun 15, 2022 | Image GenerationImage to text | CodeCode Available | 1 | 5 |
| Brain Captioning: Decoding human brain activity into images and text | May 19, 2023 | Brain DecodingDepth Estimation | CodeCode Available | 1 | 5 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 | 5 |
| Can MLLMs Perform Text-to-Image In-Context Learning? | Feb 2, 2024 | Image GenerationImage to text | CodeCode Available | 1 | 5 |
| ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation | Dec 31, 2021 | Image CaptioningImage Generation | CodeCode Available | 1 | 5 |
| Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment | Feb 2, 2023 | AttributeFew-Shot Image Classification | CodeCode Available | 1 | 5 |
| Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models | Nov 9, 2022 | Image GenerationImage to text | CodeCode Available | 1 | 5 |
| Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval | Aug 8, 2023 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 | 5 |
| FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training | Nov 18, 2024 | Data AugmentationImage to text | CodeCode Available | 1 | 5 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 | 5 |
| Probing Multimodal Large Language Models for Global and Local Semantic Representations | Feb 27, 2024 | Image to textobject-detection | CodeCode Available | 0 | 5 |
| BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval | Jun 14, 2024 | Image RetrievalImage to text | CodeCode Available | 0 | 5 |
| Pragmatic Radiology Report Generation | Nov 28, 2023 | Image to text | CodeCode Available | 0 | 5 |
| PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval | Mar 20, 2025 | Contrastive LearningCross-Modal Retrieval | CodeCode Available | 0 | 5 |
| Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation | May 23, 2024 | Image to textSentence | CodeCode Available | 0 | 5 |
| MirrorGAN: Learning Text-to-image Generation by Redescription | Mar 14, 2019 | DiversityImage Generation | CodeCode Available | 0 | 5 |