| Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages | Nov 24, 2021 | DecoderImage to text | —Unverified | 0 |
| Vision-Braille: An End-to-End Tool for Chinese Braille Image-to-Text Translation | Jul 8, 2024 | Image to textLifelong learning | —Unverified | 0 |
| Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation | Apr 30, 2024 | Caption GenerationHallucination | —Unverified | 0 |
| When are Lemons Purple? The Concept Association Bias of Vision-Language Models | Dec 22, 2022 | Attributeimage-classification | —Unverified | 0 |
| X-Fusion: Introducing New Modality to Frozen Large Language Models | Apr 29, 2025 | Image to text | —Unverified | 0 |
| 15M Multimodal Facial Image-Text Dataset | Jul 11, 2024 | Image to text | —Unverified | 0 |
| RefineNet: Enhancing Text-to-Image Conversion with High-Resolution and Detail Accuracy through Hierarchical Transformers and Progressive Refinement | Dec 27, 2023 | Computational EfficiencyImage Generation | —Unverified | 0 |
| Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API | Oct 7, 2023 | Decoderdocument understanding | —Unverified | 0 |
| Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags | Jun 16, 2024 | Image to textInstruction Following | —Unverified | 0 |
| Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation | Jan 1, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Retrieval-Augmented Multimodal Language Modeling | Nov 22, 2022 | Caption GenerationImage Captioning | —Unverified | 0 |
| PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval | Mar 20, 2025 | Contrastive LearningCross-Modal Retrieval | CodeCode Available | 0 |
| PromptHash:Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval | Jan 1, 2025 | Contrastive LearningImage Retrieval | CodeCode Available | 0 |
| Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models | Feb 18, 2025 | Image to textOptical Character Recognition | CodeCode Available | 0 |
| Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR data | Mar 19, 2025 | Image to text | CodeCode Available | 0 |
| MirrorGAN: Learning Text-to-image Generation by Redescription | Mar 14, 2019 | DiversityImage Generation | CodeCode Available | 0 |
| CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval | Sep 18, 2023 | Image to textPerson Retrieval | CodeCode Available | 0 |
| Characterizing and Understanding the Behavior of Quantized Models for Reliable Deployment | Apr 8, 2022 | Image to textLanguage Modeling | CodeCode Available | 0 |
| Probing Multimodal Large Language Models for Global and Local Semantic Representations | Feb 27, 2024 | Image to textobject-detection | CodeCode Available | 0 |
| UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings | May 17, 2025 | Image to textInformation Retrieval | CodeCode Available | 0 |
| Delving into the Openness of CLIP | Jun 4, 2022 | image-classificationImage Classification | CodeCode Available | 0 |
| Text-to-Image-to-Text Translation using Cycle Consistent Adversarial Networks | Aug 14, 2018 | Image to textSentence | CodeCode Available | 0 |
| Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs) | Oct 25, 2024 | AttributeImage to text | CodeCode Available | 0 |
| Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search | Sep 28, 2023 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 0 |
| Adaptively Clustering Neighbor Elements for Image-Text Generation | Jan 5, 2023 | ClusteringDecoder | CodeCode Available | 0 |