| Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering | Sep 29, 2023 | Image to textPassage Retrieval | CodeCode Available | 2 |
| Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search | Sep 28, 2023 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 0 |
| SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution | Sep 25, 2023 | Image to text | —Unverified | 0 |
| Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels | Sep 18, 2023 | Generative Adversarial NetworkHandwriting Recognition | —Unverified | 0 |
| CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval | Sep 18, 2023 | Image to textPerson Retrieval | CodeCode Available | 0 |
| BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification | Sep 9, 2023 | Image to textLanguage Modeling | —Unverified | 0 |
| Sequential Semantic Generative Communication for Progressive Text-to-Image Generation | Sep 8, 2023 | Image GenerationImage to text | —Unverified | 0 |
| Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | Sep 5, 2023 | DecoderImage Generation | CodeCode Available | 2 |
| Multimodal Foundation Models For Echocardiogram Interpretation | Aug 29, 2023 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 1 |
| Beyond One-to-One: Rethinking the Referring Image Segmentation | Aug 26, 2023 | DecoderImage Segmentation | CodeCode Available | 1 |
| Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages | Aug 23, 2023 | Image GenerationImage to text | CodeCode Available | 6 |
| GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training | Aug 22, 2023 | image-classificationImage Classification | —Unverified | 0 |
| Vision-Language Dataset Distillation | Aug 15, 2023 | Dataset Distillationimage-classification | CodeCode Available | 1 |
| Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval | Aug 8, 2023 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 |
| Multimodal Neurons in Pretrained Text-Only Transformers | Aug 3, 2023 | Image CaptioningImage to text | —Unverified | 0 |
| Revisiting DETR Pre-training for Object Detection | Aug 2, 2023 | Image to textObject | —Unverified | 0 |
| Transferable Decoding with Visual Entities for Zero-Shot Image Captioning | Jul 31, 2023 | Caption GenerationHallucination | CodeCode Available | 1 |
| PRIOR: Prototype Representation Joint Learning from Medical Images and Reports | Jul 24, 2023 | Contrastive LearningImage to text | CodeCode Available | 1 |
| Towards a Visual-Language Foundation Model for Computational Pathology | Jul 24, 2023 | Contrastive Learningimage-classification | —Unverified | 0 |
| Planting a SEED of Vision in Large Language Model | Jul 16, 2023 | Image GenerationImage to text | CodeCode Available | 2 |
| PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting | Jul 14, 2023 | Cross-Modal RetrievalImage to text | —Unverified | 0 |
| Bootstrapping Vision-Language Learning with Decoupled Language Pre-training | Jul 13, 2023 | Image to text | CodeCode Available | 1 |
| Emu: Generative Pretraining in Multimodality | Jul 11, 2023 | Image CaptioningImage Generation | CodeCode Available | 3 |
| MultiQG-TI: Towards Question Generation from Multi-modal Sources | Jul 7, 2023 | Image to textOptical Character Recognition | CodeCode Available | 0 |
| Zero-shot Nuclei Detection via Visual-Language Pre-trained Models | Jun 30, 2023 | Image to textobject-detection | CodeCode Available | 0 |