| DiffusionSTR: Diffusion Model for Scene Text Recognition | Jun 29, 2023 | Image to textmodel | —Unverified | 0 |
| I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models | Jun 13, 2023 | Adversarial AttackDecoder | —Unverified | 0 |
| CapText: Large Language Model-based Caption Generation From Image Context and Description | Jun 1, 2023 | Caption GenerationImage to text | —Unverified | 0 |
| Brain Captioning: Decoding human brain activity into images and text | May 19, 2023 | Brain DecodingDepth Estimation | CodeCode Available | 1 |
| What You See is What You Read? Improving Text-Image Alignment Evaluation | May 17, 2023 | Image GenerationImage to text | CodeCode Available | 1 |
| Category-Oriented Representation Learning for Image to Multi-Modal Retrieval | May 6, 2023 | Cross-Modal RetrievalImage Retrieval | —Unverified | 0 |
| Image Captioners Sometimes Tell More Than Images They See | May 4, 2023 | DescriptiveImage Captioning | —Unverified | 0 |
| Multimodal Procedural Planning via Dual Text-Image Prompting | May 2, 2023 | Image GenerationImage to text | CodeCode Available | 1 |
| Interpreting Vision and Language Generative Models with Semantic Visual Priors | Apr 28, 2023 | Image to text | —Unverified | 0 |
| RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models | Apr 21, 2023 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 0 |
| Is Cross-modal Information Retrieval Possible without Training? | Apr 20, 2023 | Contrastive LearningCross-Modal Information Retrieval | —Unverified | 0 |
| Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models | Mar 30, 2023 | Image to textPrompt Learning | —Unverified | 0 |
| CoBIT: A Contrastive Bi-directional Image-Text Generation Model | Mar 23, 2023 | DecoderImage Generation | —Unverified | 0 |
| MAGVLT: Masked Generative Vision-and-Language Transformer | Mar 21, 2023 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling | Mar 13, 2023 | DecoderImage to text | —Unverified | 0 |
| One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale | Mar 12, 2023 | AllImage Generation | CodeCode Available | 3 |
| ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation | Mar 11, 2023 | Image CaptioningImage to text | CodeCode Available | 1 |
| An End-to-End Neural Network for Image-to-Audio Transformation | Mar 10, 2023 | Image to texttext-to-speech | —Unverified | 0 |
| Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts | Feb 17, 2023 | Image RetrievalImage-text Classification | CodeCode Available | 1 |
| VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval | Feb 13, 2023 | Cross-Modal Information RetrievalCross-Modal Retrieval | —Unverified | 0 |
| Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning | Feb 9, 2023 | Few-Shot LearningImage Captioning | —Unverified | 0 |
| Generative Diffusion Models on Graphs: Methods and Applications | Feb 6, 2023 | DenoisingGraph Generation | CodeCode Available | 2 |
| Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment | Feb 2, 2023 | AttributeFew-Shot Image Classification | CodeCode Available | 1 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 |
| Adaptively Clustering Neighbor Elements for Image-Text Generation | Jan 5, 2023 | ClusteringDecoder | CodeCode Available | 0 |
| SLAN: Self-Locator Aided Network for Vision-Language Understanding | Jan 1, 2023 | Image RetrievalImage to text | —Unverified | 0 |
| Do DALL-E and Flamingo Understand Each Other? | Dec 23, 2022 | Image CaptioningImage Generation | —Unverified | 0 |
| When are Lemons Purple? The Concept Association Bias of Vision-Language Models | Dec 22, 2022 | Attributeimage-classification | —Unverified | 0 |
| MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | Dec 19, 2022 | Chart Question AnsweringData Summarization | CodeCode Available | 0 |
| SLAN: Self-Locator Aided Network for Cross-Modal Understanding | Nov 28, 2022 | Image RetrievalImage to text | —Unverified | 0 |
| Retrieval-Augmented Multimodal Language Modeling | Nov 22, 2022 | Caption GenerationImage Captioning | —Unverified | 0 |
| Versatile Diffusion: Text, Images and Variations All in One Diffusion Model | Nov 15, 2022 | AllDisentanglement | CodeCode Available | 6 |
| Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models | Nov 9, 2022 | Image GenerationImage to text | CodeCode Available | 1 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 |
| Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards | Oct 21, 2022 | Image to textnamed-entity-recognition | CodeCode Available | 0 |
| Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation | Oct 20, 2022 | DecoderImage Captioning | CodeCode Available | 1 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 |
| Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | Oct 7, 2022 | Chart Question AnsweringDiversity | CodeCode Available | 2 |
| Cross-modal Contrastive Attention Model for Medical Report Generation | Oct 1, 2022 | Image to textMedical Report Generation | —Unverified | 0 |
| Linearly Mapping from Image to Text Space | Sep 30, 2022 | Image CaptioningImage to text | CodeCode Available | 1 |
| FETA: Towards Specializing Foundation Models for Expert Task Applications | Sep 8, 2022 | Domain GeneralizationFew-Shot Learning | CodeCode Available | 1 |
| Every picture tells a story: Image-grounded controllable stylistic story generation | Sep 4, 2022 | Image CaptioningImage to text | —Unverified | 0 |
| Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning | Aug 18, 2022 | Image GenerationImage to text | —Unverified | 0 |
| Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval | Jul 29, 2022 | Cross-Modal RetrievalData Augmentation | —Unverified | 0 |
| SRCB at SemEval-2022 Task 5: Pretraining Based Image to Text Late Sequential Fusion System for Multimodal Misogynous Meme Identification | Jul 1, 2022 | Image to text | —Unverified | 0 |
| What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs | Jun 19, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| Write and Paint: Generative Vision-Language Models are Unified Modal Learners | Jun 15, 2022 | Image GenerationImage to text | CodeCode Available | 1 |
| Delving into the Openness of CLIP | Jun 4, 2022 | image-classificationImage Classification | CodeCode Available | 0 |
| Multilingual Image Corpus – Towards a Multimodal and Multilingual Dataset | Jun 1, 2022 | Caption Generationimage-classification | —Unverified | 0 |
| GIT: A Generative Image-to-text Transformer for Vision and Language | May 27, 2022 | DecoderImage Captioning | CodeCode Available | 2 |