Image to text

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 246 papers

Title	Date	Tasks	Status	Hype
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages	Aug 23, 2023	Image GenerationImage to text	CodeCode Available	6
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model	Nov 15, 2022	AllDisentanglement	CodeCode Available	6
FlowTok: Flowing Seamlessly Across Text and Image Tokens	Mar 13, 2025	DenoisingImage to text	CodeCode Available	5
Magma: A Foundation Model for Multimodal AI Agents	Feb 18, 2025	Autonomous Web NavigationImage to text	CodeCode Available	5
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Jan 30, 2023	Generative Visual Question AnsweringImage Captioning	CodeCode Available	4
Evaluating Text-to-Visual Generation with Image-to-Text Generation	Apr 1, 2024	Image to textQuestion Answering	CodeCode Available	3
Emu: Generative Pretraining in Multimodality	Jul 11, 2023	Image CaptioningImage Generation	CodeCode Available	3
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale	Mar 12, 2023	AllImage Generation	CodeCode Available	3
Semantic Editing Increment Benefits Zero-Shot Composed Image Retrieval	Oct 28, 2024	Image RetrievalImage to text	CodeCode Available	2
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation	Aug 9, 2024	Image to textObject	CodeCode Available	2
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities	Jul 29, 2024	Contrastive LearningDeepFake Detection	CodeCode Available	2
LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval	Jul 11, 2024	Image RetrievalImage to text	CodeCode Available	2
Libra: Building Decoupled Vision System on Large Language Models	May 16, 2024	Image to textLanguage Modeling	CodeCode Available	2
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching	Apr 4, 2024	AttributeImage Captioning	CodeCode Available	2
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models	Apr 1, 2024	Graph GenerationImage to text	CodeCode Available	2
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering	Sep 29, 2023	Image to textPassage Retrieval	CodeCode Available	2
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	Sep 5, 2023	DecoderImage Generation	CodeCode Available	2
Planting a SEED of Vision in Large Language Model	Jul 16, 2023	Image GenerationImage to text	CodeCode Available	2
Generative Diffusion Models on Graphs: Methods and Applications	Feb 6, 2023	DenoisingGraph Generation	CodeCode Available	2
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding	Oct 7, 2022	Chart Question AnsweringDiversity	CodeCode Available	2
GIT: A Generative Image-to-text Transformer for Vision and Language	May 27, 2022	DecoderImage Captioning	CodeCode Available	2
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models	Jun 10, 2025	Contrastive LearningImage-text matching	CodeCode Available	1
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs	Apr 11, 2025	BenchmarkingImage Generation	CodeCode Available	1
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text	Mar 25, 2025	Cross-Modal RetrievalHallucination	CodeCode Available	1
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles	Mar 5, 2025	Domain AdaptationImage to text	CodeCode Available	1
UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding	Feb 8, 2025	DenoisingImage Generation	CodeCode Available	1
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?	Jan 5, 2025	Image CaptioningImage to text	CodeCode Available	1
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training	Nov 18, 2024	Data AugmentationImage to text	CodeCode Available	1
See or Guess: Counterfactually Regularized Image Captioning	Aug 29, 2024	Causal Inferencecounterfactual	CodeCode Available	1
UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation	Aug 21, 2024	Image GenerationImage Retrieval	CodeCode Available	1
CMC-Bench: Towards a New Paradigm of Visual Signal Compression	Jun 13, 2024	Image CompressionImage to text	CodeCode Available	1
Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design	May 29, 2024	Dataset GenerationImage to text	CodeCode Available	1
Language-Oriented Semantic Latent Representation for Image Transmission	May 16, 2024	Image to textSemantic Communication	CodeCode Available	1
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?	Apr 16, 2024	Image CaptioningImage Generation	CodeCode Available	1
ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes	Mar 7, 2024	Image to textObject	CodeCode Available	1
Can MLLMs Perform Text-to-Image In-Context Learning?	Feb 2, 2024	Image GenerationImage to text	CodeCode Available	1
Benchmarking Large Multimodal Models against Common Corruptions	Jan 22, 2024	BenchmarkingImage to text	CodeCode Available	1
Improving Image Restoration through Removing Degradations in Textual Representations	Dec 28, 2023	DeblurringDenoising	CodeCode Available	1
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models	Nov 27, 2023	Cross-Modal RetrievalImage Generation	CodeCode Available	1
UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web	Oct 22, 2023	Image to textLanguage Modeling	CodeCode Available	1
Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition	Oct 8, 2023	Image to textOptical Character Recognition (OCR)	CodeCode Available	1
Multimodal Foundation Models For Echocardiogram Interpretation	Aug 29, 2023	Cross-Modal RetrievalDiagnostic	CodeCode Available	1
Beyond One-to-One: Rethinking the Referring Image Segmentation	Aug 26, 2023	DecoderImage Segmentation	CodeCode Available	1
Vision-Language Dataset Distillation	Aug 15, 2023	Dataset Distillationimage-classification	CodeCode Available	1
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval	Aug 8, 2023	Cross-Modal RetrievalImage Retrieval	CodeCode Available	1
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning	Jul 31, 2023	Caption GenerationHallucination	CodeCode Available	1
PRIOR: Prototype Representation Joint Learning from Medical Images and Reports	Jul 24, 2023	Contrastive LearningImage to text	CodeCode Available	1
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training	Jul 13, 2023	Image to text	CodeCode Available	1
Brain Captioning: Decoding human brain activity into images and text	May 19, 2023	Brain DecodingDepth Estimation	CodeCode Available	1
What You See is What You Read? Improving Text-Image Alignment Evaluation	May 17, 2023	Image GenerationImage to text	CodeCode Available	1

Show:10 25 50

← PrevPage 1 of 5Next →

No leaderboard results yet.