| BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification | Sep 9, 2023 | Image to textLanguage Modeling | —Unverified | 0 |
| Sequential Semantic Generative Communication for Progressive Text-to-Image Generation | Sep 8, 2023 | Image GenerationImage to text | —Unverified | 0 |
| Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | Sep 5, 2023 | DecoderImage Generation | CodeCode Available | 2 |
| Multimodal Foundation Models For Echocardiogram Interpretation | Aug 29, 2023 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 1 |
| Beyond One-to-One: Rethinking the Referring Image Segmentation | Aug 26, 2023 | DecoderImage Segmentation | CodeCode Available | 1 |
| Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages | Aug 23, 2023 | Image GenerationImage to text | CodeCode Available | 6 |
| GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training | Aug 22, 2023 | image-classificationImage Classification | —Unverified | 0 |
| Vision-Language Dataset Distillation | Aug 15, 2023 | Dataset Distillationimage-classification | CodeCode Available | 1 |
| Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval | Aug 8, 2023 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 |
| Multimodal Neurons in Pretrained Text-Only Transformers | Aug 3, 2023 | Image CaptioningImage to text | —Unverified | 0 |