| Zero-shot Nuclei Detection via Visual-Language Pre-trained Models | Jun 30, 2023 | Image to textobject-detection | CodeCode Available | 0 | 5 |
| GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models | Jul 30, 2024 | Image to textImage-to-Text Retrieval | CodeCode Available | 0 | 5 |
| Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search | Sep 28, 2023 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 0 | 5 |
| Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions | Mar 10, 2018 | Image DescriptionImage to text | CodeCode Available | 0 | 5 |
| Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags | Jun 16, 2024 | Image to textInstruction Following | —Unverified | 0 | 0 |
| Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation | Jan 1, 2025 | image-classificationImage Classification | —Unverified | 0 | 0 |
| Retrieval-Augmented Multimodal Language Modeling | Nov 22, 2022 | Caption GenerationImage Captioning | —Unverified | 0 | 0 |
| Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning | Feb 9, 2023 | Few-Shot LearningImage Captioning | —Unverified | 0 | 0 |
| Revisiting DETR Pre-training for Object Detection | Aug 2, 2023 | Image to textObject | —Unverified | 0 | 0 |
| Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization | Sep 26, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 | 0 |
| Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization | Oct 30, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 | 0 |
| Robustifying Vision-Language Models via Dynamic Token Reweighting | May 22, 2025 | Image to text | —Unverified | 0 | 0 |
| See then Tell: Enhancing Key Information Extraction with Vision Grounding | Sep 29, 2024 | Image to textKey Information Extraction | —Unverified | 0 | 0 |
| SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs | Apr 17, 2025 | Cross-Modal RetrievalImage Retrieval | —Unverified | 0 | 0 |
| Sequential Semantic Generative Communication for Progressive Text-to-Image Generation | Sep 8, 2023 | Image GenerationImage to text | —Unverified | 0 | 0 |
| SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing | Oct 12, 2023 | Image GenerationImage to text | —Unverified | 0 | 0 |
| SLAN: Self-Locator Aided Network for Cross-Modal Understanding | Nov 28, 2022 | Image RetrievalImage to text | —Unverified | 0 | 0 |
| SLAN: Self-Locator Aided Network for Vision-Language Understanding | Jan 1, 2023 | Image RetrievalImage to text | —Unverified | 0 | 0 |
| SRCB at SemEval-2022 Task 5: Pretraining Based Image to Text Late Sequential Fusion System for Multimodal Misogynous Meme Identification | Jul 1, 2022 | Image to text | —Unverified | 0 | 0 |
| SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution | Sep 25, 2023 | Image to text | —Unverified | 0 | 0 |
| Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval | May 16, 2021 | Graph GenerationImage Captioning | —Unverified | 0 | 0 |
| SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment | Jan 4, 2024 | Image Captioningimage-classification | —Unverified | 0 | 0 |
| Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image | Oct 20, 2024 | Image to text | —Unverified | 0 | 0 |
| Synthesizing Novel Pairs of Image and Text | Dec 18, 2017 | Image to text | —Unverified | 0 | 0 |
| Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models | Mar 30, 2023 | Image to textPrompt Learning | —Unverified | 0 | 0 |