SOTAVerified

Image to text

Papers

Showing 151200 of 246 papers

TitleStatusHype
DiffusionSTR: Diffusion Model for Scene Text Recognition0
I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models0
CapText: Large Language Model-based Caption Generation From Image Context and Description0
Brain Captioning: Decoding human brain activity into images and textCode1
What You See is What You Read? Improving Text-Image Alignment EvaluationCode1
Category-Oriented Representation Learning for Image to Multi-Modal Retrieval0
Image Captioners Sometimes Tell More Than Images They See0
Multimodal Procedural Planning via Dual Text-Image PromptingCode1
Interpreting Vision and Language Generative Models with Semantic Visual Priors0
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching ModelsCode0
Is Cross-modal Information Retrieval Possible without Training?0
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models0
CoBIT: A Contrastive Bi-directional Image-Text Generation Model0
MAGVLT: Masked Generative Vision-and-Language TransformerCode1
Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling0
One Transformer Fits All Distributions in Multi-Modal Diffusion at ScaleCode3
ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language GenerationCode1
An End-to-End Neural Network for Image-to-Audio Transformation0
Towards Unifying Medical Vision-and-Language Pre-training via Soft PromptsCode1
VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval0
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning0
Generative Diffusion Models on Graphs: Methods and ApplicationsCode2
Language Quantized AutoEncoders: Towards Unsupervised Text-Image AlignmentCode1
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsCode4
Adaptively Clustering Neighbor Elements for Image-Text GenerationCode0
SLAN: Self-Locator Aided Network for Vision-Language Understanding0
Do DALL-E and Flamingo Understand Each Other?0
When are Lemons Purple? The Concept Association Bias of Vision-Language Models0
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart DerenderingCode0
SLAN: Self-Locator Aided Network for Cross-Modal Understanding0
Retrieval-Augmented Multimodal Language Modeling0
Versatile Diffusion: Text, Images and Variations All in One Diffusion ModelCode6
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion ModelsCode1
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision0
Improving the Factual Correctness of Radiology Report Generation with Semantic RewardsCode0
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text GenerationCode1
Image Semantic Relation Generation0
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language UnderstandingCode2
Cross-modal Contrastive Attention Model for Medical Report Generation0
Linearly Mapping from Image to Text SpaceCode1
FETA: Towards Specializing Foundation Models for Expert Task ApplicationsCode1
Every picture tells a story: Image-grounded controllable stylistic story generation0
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning0
Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval0
SRCB at SemEval-2022 Task 5: Pretraining Based Image to Text Late Sequential Fusion System for Multimodal Misogynous Meme Identification0
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text InputsCode1
Write and Paint: Generative Vision-Language Models are Unified Modal LearnersCode1
Delving into the Openness of CLIPCode0
Multilingual Image Corpus – Towards a Multimodal and Multilingual Dataset0
GIT: A Generative Image-to-text Transformer for Vision and LanguageCode2
Show:102550
← PrevPage 4 of 5Next →

No leaderboard results yet.