SOTAVerified

Image to text

Papers

Showing 2650 of 246 papers

TitleStatusHype
Magma: A Foundation Model for Multimodal AI AgentsCode5
UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation0
UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and UnderstandingCode1
Multi-LLM Collaborative Caption Generation in Scientific DocumentsCode0
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?Code1
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training0
Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation0
PromptHash:Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing RetrievalCode0
Survey on Abstractive Text Summarization: Dataset, Models, and MetricsCode0
CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIPCode0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding0
Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation0
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-trainingCode1
Everything is a Video: Unifying Modalities through Next-Frame Prediction0
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models0
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing0
Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization0
Semantic Editing Increment Benefits Zero-Shot Composed Image RetrievalCode2
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)Code0
Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics0
Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image0
An Online Learning Approach to Prompt-based Selection of Generative Models0
Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models0
Backdooring Vision-Language Models with Out-Of-Distribution Data0
See then Tell: Enhancing Key Information Extraction with Vision Grounding0
Show:102550
← PrevPage 2 of 10Next →

No leaderboard results yet.