SOTAVerified

Image to text

Papers

Showing 101150 of 246 papers

TitleStatusHype
Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution0
ABC: Achieving Better Control of Multimodal Embeddings using VLMs0
Accept the Modality Gap: An Exploration in the Hyperbolic Space0
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training0
AICoderEval: Improving AI Domain Code Generation of Large Language Models0
AI Recommendation System for Enhanced Customer Experience: A Novel Image-to-Text Method0
An End-to-End Neural Network for Image-to-Audio Transformation0
An Online Learning Approach to Prompt-based Selection of Generative Models0
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models0
A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering0
Attention Guidance Mechanism for Handwritten Mathematical Expression Recognition0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models0
Backdooring Vision-Language Models with Out-Of-Distribution Data0
Better Text Understanding Through Image-To-Text Transfer0
Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics0
Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report Generation0
BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification0
BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval0
BRIT: Bidirectional Retrieval over Unified Image-Text Graph0
Canonical Correlation Analysis for Misaligned Satellite Image Change Detection0
CapText: Large Language Model-based Caption Generation From Image Context and Description0
Captions Are Worth a Thousand Words: Enhancing Product Retrieval with Pretrained Image-to-Text Models0
ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering0
VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval0
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?0
CoBIT: A Contrastive Bi-directional Image-Text Generation Model0
Contrastive Learning of Visual-Semantic Embeddings0
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval0
Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval0
Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation0
Cross-modal Contrastive Attention Model for Medical Report Generation0
Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic0
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation0
Deductron -- A Recurrent Neural Network0
Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese0
DiffusionSTR: Diffusion Model for Scene Text Recognition0
DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding0
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning0
Doc2Im: document to image conversion through self-attentive embedding0
DOCCI: Descriptions of Connected and Contrasting Images0
Do DALL-E and Flamingo Understand Each Other?0
Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection0
Dynamic Traceback Learning for Medical Report Generation0
Efficient End-to-End Visual Document Understanding with Rationale Distillation0
EI-CLIP: Entity-Aware Interventional Contrastive Learning for E-Commerce Cross-Modal Retrieval0
EmojiGAN: learning emojis distributions with a generative model0
Enhancing Vision-Language Pre-training with Rich Supervisions0
Evaluating authenticity and quality of image captions via sentiment and semantic analyses0
Every picture tells a story: Image-grounded controllable stylistic story generation0
Show:102550
← PrevPage 3 of 5Next →

No leaderboard results yet.