SOTAVerified

Image to text

Papers

Showing 51100 of 246 papers

TitleStatusHype
Multimodal Procedural Planning via Dual Text-Image PromptingCode1
MAGVLT: Masked Generative Vision-and-Language TransformerCode1
ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language GenerationCode1
Towards Unifying Medical Vision-and-Language Pre-training via Soft PromptsCode1
Language Quantized AutoEncoders: Towards Unsupervised Text-Image AlignmentCode1
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion ModelsCode1
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text GenerationCode1
Linearly Mapping from Image to Text SpaceCode1
FETA: Towards Specializing Foundation Models for Expert Task ApplicationsCode1
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text InputsCode1
Write and Paint: Generative Vision-Language Models are Unified Modal LearnersCode1
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language GenerationCode1
Distilled Dual-Encoder Model for Vision-Language UnderstandingCode1
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic ArithmeticCode1
L-Verse: Bidirectional Generation Between Image and TextCode1
Unifying Multimodal Transformer for Bi-directional Image and Text GenerationCode1
Concadia: Towards Image-Based Text Generation with a PurposeCode1
Progressive Transformer-Based Generation of Radiology ReportsCode1
Improving Factual Completeness and Consistency of Image-to-Text Radiology Report GenerationCode1
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration0
ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering0
TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP0
BRIT: Bidirectional Retrieval over Unified Image-Text Graph0
Robustifying Vision-Language Models via Dynamic Token Reweighting0
UniMoCo: Unified Modality Completion for Robust Multi-Modal EmbeddingsCode0
Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution0
X-Fusion: Introducing New Modality to Frozen Large Language Models0
SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs0
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation0
TMCIR: Token Merge Benefits Composed Image Retrieval0
Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module0
Natural Language Generation0
PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing RetrievalCode0
Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR dataCode0
MFP-CLIP: Exploring the Efficacy of Multi-Form Prompts for Zero-Shot Industrial Anomaly Detection0
ABC: Achieving Better Control of Multimodal Embeddings using VLMs0
On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation0
Natural Language Generation from Visual Sequences: Challenges and Future Directions0
Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language modelsCode0
UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation0
Multi-LLM Collaborative Caption Generation in Scientific DocumentsCode0
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training0
Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation0
PromptHash:Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing RetrievalCode0
Survey on Abstractive Text Summarization: Dataset, Models, and MetricsCode0
CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIPCode0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding0
Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation0
Everything is a Video: Unifying Modalities through Next-Frame Prediction0
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models0
Show:102550
← PrevPage 2 of 5Next →

No leaderboard results yet.