| Is Cross-modal Information Retrieval Possible without Training? | Apr 20, 2023 | Contrastive LearningCross-Modal Information Retrieval | —Unverified | 0 |
| Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models | Mar 30, 2023 | Image to textPrompt Learning | —Unverified | 0 |
| CoBIT: A Contrastive Bi-directional Image-Text Generation Model | Mar 23, 2023 | DecoderImage Generation | —Unverified | 0 |
| MAGVLT: Masked Generative Vision-and-Language Transformer | Mar 21, 2023 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling | Mar 13, 2023 | DecoderImage to text | —Unverified | 0 |
| One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale | Mar 12, 2023 | AllImage Generation | CodeCode Available | 3 |
| ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation | Mar 11, 2023 | Image CaptioningImage to text | CodeCode Available | 1 |
| An End-to-End Neural Network for Image-to-Audio Transformation | Mar 10, 2023 | Image to texttext-to-speech | —Unverified | 0 |
| Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts | Feb 17, 2023 | Image RetrievalImage-text Classification | CodeCode Available | 1 |
| VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval | Feb 13, 2023 | Cross-Modal Information RetrievalCross-Modal Retrieval | —Unverified | 0 |