| RefineNet: Enhancing Text-to-Image Conversion with High-Resolution and Detail Accuracy through Hierarchical Transformers and Progressive Refinement | Dec 27, 2023 | Computational EfficiencyImage Generation | —Unverified | 0 |
| DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models | Dec 12, 2023 | DenoisingDiversity | —Unverified | 0 |
| Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval | Dec 4, 2023 | AttributeCross-Modal Person Re-Identification | —Unverified | 0 |
| Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection | Dec 4, 2023 | Image to textobject-detection | —Unverified | 0 |
| Pragmatic Radiology Report Generation | Nov 28, 2023 | Image to text | CodeCode Available | 0 |
| Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report Generation | Nov 18, 2023 | Image to textSemantic Similarity | —Unverified | 0 |
| AI Recommendation System for Enhanced Customer Experience: A Novel Image-to-Text Method | Nov 16, 2023 | Image to textObject | —Unverified | 0 |
| Efficient End-to-End Visual Document Understanding with Rationale Distillation | Nov 16, 2023 | document understandingImage to text | —Unverified | 0 |
| Semantically Grounded QFormer for Efficient Vision Language Understanding | Nov 13, 2023 | DiversityImage to text | —Unverified | 0 |
| GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks | Nov 2, 2023 | Image GenerationImage to text | —Unverified | 0 |
| Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning | Oct 12, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing | Oct 12, 2023 | Image GenerationImage to text | —Unverified | 0 |
| Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API | Oct 7, 2023 | Decoderdocument understanding | —Unverified | 0 |
| Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency | Oct 5, 2023 | Image GenerationImage to text | —Unverified | 0 |
| Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search | Sep 28, 2023 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 0 |
| SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution | Sep 25, 2023 | Image to text | —Unverified | 0 |
| CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval | Sep 18, 2023 | Image to textPerson Retrieval | CodeCode Available | 0 |
| Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels | Sep 18, 2023 | Generative Adversarial NetworkHandwriting Recognition | —Unverified | 0 |
| BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification | Sep 9, 2023 | Image to textLanguage Modeling | —Unverified | 0 |
| Sequential Semantic Generative Communication for Progressive Text-to-Image Generation | Sep 8, 2023 | Image GenerationImage to text | —Unverified | 0 |
| GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training | Aug 22, 2023 | image-classificationImage Classification | —Unverified | 0 |
| Multimodal Neurons in Pretrained Text-Only Transformers | Aug 3, 2023 | Image CaptioningImage to text | —Unverified | 0 |
| Revisiting DETR Pre-training for Object Detection | Aug 2, 2023 | Image to textObject | —Unverified | 0 |
| Towards a Visual-Language Foundation Model for Computational Pathology | Jul 24, 2023 | Contrastive Learningimage-classification | —Unverified | 0 |
| PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting | Jul 14, 2023 | Cross-Modal RetrievalImage to text | —Unverified | 0 |
| MultiQG-TI: Towards Question Generation from Multi-modal Sources | Jul 7, 2023 | Image to textOptical Character Recognition | CodeCode Available | 0 |
| Zero-shot Nuclei Detection via Visual-Language Pre-trained Models | Jun 30, 2023 | Image to textobject-detection | CodeCode Available | 0 |
| DiffusionSTR: Diffusion Model for Scene Text Recognition | Jun 29, 2023 | Image to textmodel | —Unverified | 0 |
| I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models | Jun 13, 2023 | Adversarial AttackDecoder | —Unverified | 0 |
| CapText: Large Language Model-based Caption Generation From Image Context and Description | Jun 1, 2023 | Caption GenerationImage to text | —Unverified | 0 |
| Category-Oriented Representation Learning for Image to Multi-Modal Retrieval | May 6, 2023 | Cross-Modal RetrievalImage Retrieval | —Unverified | 0 |
| Image Captioners Sometimes Tell More Than Images They See | May 4, 2023 | DescriptiveImage Captioning | —Unverified | 0 |
| Interpreting Vision and Language Generative Models with Semantic Visual Priors | Apr 28, 2023 | Image to text | —Unverified | 0 |
| RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models | Apr 21, 2023 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 0 |
| Is Cross-modal Information Retrieval Possible without Training? | Apr 20, 2023 | Contrastive LearningCross-Modal Information Retrieval | —Unverified | 0 |
| Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models | Mar 30, 2023 | Image to textPrompt Learning | —Unverified | 0 |
| CoBIT: A Contrastive Bi-directional Image-Text Generation Model | Mar 23, 2023 | DecoderImage Generation | —Unverified | 0 |
| Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling | Mar 13, 2023 | DecoderImage to text | —Unverified | 0 |
| An End-to-End Neural Network for Image-to-Audio Transformation | Mar 10, 2023 | Image to texttext-to-speech | —Unverified | 0 |
| VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval | Feb 13, 2023 | Cross-Modal Information RetrievalCross-Modal Retrieval | —Unverified | 0 |
| Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning | Feb 9, 2023 | Few-Shot LearningImage Captioning | —Unverified | 0 |
| Adaptively Clustering Neighbor Elements for Image-Text Generation | Jan 5, 2023 | ClusteringDecoder | CodeCode Available | 0 |
| SLAN: Self-Locator Aided Network for Vision-Language Understanding | Jan 1, 2023 | Image RetrievalImage to text | —Unverified | 0 |
| Do DALL-E and Flamingo Understand Each Other? | Dec 23, 2022 | Image CaptioningImage Generation | —Unverified | 0 |
| When are Lemons Purple? The Concept Association Bias of Vision-Language Models | Dec 22, 2022 | Attributeimage-classification | —Unverified | 0 |
| MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | Dec 19, 2022 | Chart Question AnsweringData Summarization | CodeCode Available | 0 |
| SLAN: Self-Locator Aided Network for Cross-Modal Understanding | Nov 28, 2022 | Image RetrievalImage to text | —Unverified | 0 |
| Retrieval-Augmented Multimodal Language Modeling | Nov 22, 2022 | Caption GenerationImage Captioning | —Unverified | 0 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 |
| Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards | Oct 21, 2022 | Image to textnamed-entity-recognition | CodeCode Available | 0 |