SOTAVerified

Image to text

Papers

Showing 150 of 246 papers

TitleStatusHype
Versatile Diffusion: Text, Images and Variations All in One Diffusion ModelCode6
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across LanguagesCode6
Magma: A Foundation Model for Multimodal AI AgentsCode5
FlowTok: Flowing Seamlessly Across Text and Image TokensCode5
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsCode4
Evaluating Text-to-Visual Generation with Image-to-Text GenerationCode3
Emu: Generative Pretraining in MultimodalityCode3
One Transformer Fits All Distributions in Multi-Modal Diffusion at ScaleCode3
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language UnderstandingCode2
Planting a SEED of Vision in Large Language ModelCode2
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic SegmentationCode2
LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image RetrievalCode2
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept MatchingCode2
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction TuningCode2
Generative Diffusion Models on Graphs: Methods and ApplicationsCode2
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language ModelsCode2
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local SimilaritiesCode2
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question AnsweringCode2
Semantic Editing Increment Benefits Zero-Shot Composed Image RetrievalCode2
Libra: Building Decoupled Vision System on Large Language ModelsCode2
GIT: A Generative Image-to-text Transformer for Vision and LanguageCode2
Bootstrapping Vision-Language Learning with Decoupled Language Pre-trainingCode1
Language Quantized AutoEncoders: Towards Unsupervised Text-Image AlignmentCode1
Write and Paint: Generative Vision-Language Models are Unified Modal LearnersCode1
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?Code1
Improving Image Restoration through Removing Degradations in Textual RepresentationsCode1
Concadia: Towards Image-Based Text Generation with a PurposeCode1
Language-Oriented Semantic Latent Representation for Image TransmissionCode1
PRIOR: Prototype Representation Joint Learning from Medical Images and ReportsCode1
Brain Captioning: Decoding human brain activity into images and textCode1
Multimodal Foundation Models For Echocardiogram InterpretationCode1
CMC-Bench: Towards a New Paradigm of Visual Signal CompressionCode1
Multimodal Procedural Planning via Dual Text-Image PromptingCode1
MAGVLT: Masked Generative Vision-and-Language TransformerCode1
L-Verse: Bidirectional Generation Between Image and TextCode1
Vision-Language Dataset DistillationCode1
ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional ChangesCode1
FETA: Towards Specializing Foundation Models for Expert Task ApplicationsCode1
Benchmarking Large Multimodal Models against Common CorruptionsCode1
Linearly Mapping from Image to Text SpaceCode1
Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and DesignCode1
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-trainingCode1
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMsCode1
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?Code1
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision ModelsCode1
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal CyclesCode1
Can MLLMs Perform Text-to-Image In-Context Learning?Code1
Distilled Dual-Encoder Model for Vision-Language UnderstandingCode1
Beyond One-to-One: Rethinking the Referring Image SegmentationCode1
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language GenerationCode1
Show:102550
← PrevPage 1 of 5Next →

No leaderboard results yet.