| GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models | Jul 30, 2024 | Image to textImage-to-Text Retrieval | CodeCode Available | 0 | 5 |
| Adaptively Clustering Neighbor Elements for Image-Text Generation | Jan 5, 2023 | ClusteringDecoder | CodeCode Available | 0 | 5 |
| Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search | Sep 28, 2023 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 0 | 5 |
| Survey on Abstractive Text Summarization: Dataset, Models, and Metrics | Dec 22, 2024 | Abstractive Text SummarizationGeneral Knowledge | CodeCode Available | 0 | 5 |
| Multi-LLM Collaborative Caption Generation in Scientific Documents | Jan 5, 2025 | Caption GenerationImage to text | CodeCode Available | 0 | 5 |
| UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings | May 17, 2025 | Image to textInformation Retrieval | CodeCode Available | 0 | 5 |
| BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval | Jun 14, 2024 | Image RetrievalImage to text | CodeCode Available | 0 | 5 |
| Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization | Oct 30, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 | 0 |
| Robustifying Vision-Language Models via Dynamic Token Reweighting | May 22, 2025 | Image to text | —Unverified | 0 | 0 |
| See then Tell: Enhancing Key Information Extraction with Vision Grounding | Sep 29, 2024 | Image to textKey Information Extraction | —Unverified | 0 | 0 |
| SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs | Apr 17, 2025 | Cross-Modal RetrievalImage Retrieval | —Unverified | 0 | 0 |
| Sequential Semantic Generative Communication for Progressive Text-to-Image Generation | Sep 8, 2023 | Image GenerationImage to text | —Unverified | 0 | 0 |
| SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing | Oct 12, 2023 | Image GenerationImage to text | —Unverified | 0 | 0 |
| SLAN: Self-Locator Aided Network for Cross-Modal Understanding | Nov 28, 2022 | Image RetrievalImage to text | —Unverified | 0 | 0 |
| SLAN: Self-Locator Aided Network for Vision-Language Understanding | Jan 1, 2023 | Image RetrievalImage to text | —Unverified | 0 | 0 |
| SRCB at SemEval-2022 Task 5: Pretraining Based Image to Text Late Sequential Fusion System for Multimodal Misogynous Meme Identification | Jul 1, 2022 | Image to text | —Unverified | 0 | 0 |
| SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution | Sep 25, 2023 | Image to text | —Unverified | 0 | 0 |
| Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval | May 16, 2021 | Graph GenerationImage Captioning | —Unverified | 0 | 0 |
| SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment | Jan 4, 2024 | Image Captioningimage-classification | —Unverified | 0 | 0 |
| Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image | Oct 20, 2024 | Image to text | —Unverified | 0 | 0 |
| Synthesizing Novel Pairs of Image and Text | Dec 18, 2017 | Image to text | —Unverified | 0 | 0 |
| Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models | Mar 30, 2023 | Image to textPrompt Learning | —Unverified | 0 | 0 |
| TMCIR: Token Merge Benefits Composed Image Retrieval | Apr 15, 2025 | Contrastive Learningcross-modal alignment | —Unverified | 0 | 0 |
| TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP | May 24, 2025 | Image CaptioningImage Generation | —Unverified | 0 | 0 |
| Towards a Visual-Language Foundation Model for Computational Pathology | Jul 24, 2023 | Contrastive Learningimage-classification | —Unverified | 0 | 0 |
| Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual Question Answering | Jan 1, 2022 | Generative Question AnsweringImage to text | —Unverified | 0 | 0 |
| TrojVLM: Backdoor Attack Against Vision Language Models | Sep 28, 2024 | Backdoor AttackImage Captioning | —Unverified | 0 | 0 |
| Turbo Learning for Captionbot and Drawingbot | May 21, 2018 | Image CaptioningImage Generation | —Unverified | 0 | 0 |
| Two-stream Hierarchical Similarity Reasoning for Image-text Matching | Mar 10, 2022 | Image-text matchingImage to text | —Unverified | 0 | 0 |
| Uncertainty-based Cross-Modal Retrieval with Probabilistic Representations | Apr 20, 2022 | Cross-Modal RetrievalImage Retrieval | —Unverified | 0 | 0 |
| Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning | May 26, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 | 0 |
| UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation | Feb 16, 2025 | Binary ClassificationFake News Detection | —Unverified | 0 | 0 |
| Using Inter-Sentence Diverse Beam Search to Reduce Redundancy in Visual Storytelling | May 30, 2018 | Image to textSentence | —Unverified | 0 | 0 |
| Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages | Nov 24, 2021 | DecoderImage to text | —Unverified | 0 | 0 |
| Vision-Braille: An End-to-End Tool for Chinese Braille Image-to-Text Translation | Jul 8, 2024 | Image to textLifelong learning | —Unverified | 0 | 0 |
| Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation | Apr 30, 2024 | Caption GenerationHallucination | —Unverified | 0 | 0 |
| When are Lemons Purple? The Concept Association Bias of Vision-Language Models | Dec 22, 2022 | Attributeimage-classification | —Unverified | 0 | 0 |
| X-Fusion: Introducing New Modality to Frozen Large Language Models | Apr 29, 2025 | Image to text | —Unverified | 0 | 0 |
| 15M Multimodal Facial Image-Text Dataset | Jul 11, 2024 | Image to text | —Unverified | 0 | 0 |
| Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning | Oct 12, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 | 0 |
| Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution | May 16, 2025 | Cross-Modal RetrievalImage to text | —Unverified | 0 | 0 |
| ABC: Achieving Better Control of Multimodal Embeddings using VLMs | Mar 1, 2025 | Image to textImage-to-Text Retrieval | —Unverified | 0 | 0 |
| Accept the Modality Gap: An Exploration in the Hyperbolic Space | Jan 1, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 | 0 |
| Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training | Jan 1, 2025 | Image-text RetrievalImage to text | —Unverified | 0 | 0 |
| AICoderEval: Improving AI Domain Code Generation of Large Language Models | Jun 7, 2024 | Code GenerationImage to text | —Unverified | 0 | 0 |
| AI Recommendation System for Enhanced Customer Experience: A Novel Image-to-Text Method | Nov 16, 2023 | Image to textObject | —Unverified | 0 | 0 |
| An End-to-End Neural Network for Image-to-Audio Transformation | Mar 10, 2023 | Image to texttext-to-speech | —Unverified | 0 | 0 |
| An Online Learning Approach to Prompt-based Selection of Generative Models | Oct 17, 2024 | Image to text | —Unverified | 0 | 0 |
| Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models | Aug 16, 2024 | Image to text | —Unverified | 0 | 0 |
| A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering | Jan 14, 2022 | Generative Question AnsweringImage to text | —Unverified | 0 | 0 |