| Libra: Building Decoupled Vision System on Large Language Models | May 16, 2024 | Image to textLanguage Modeling | CodeCode Available | 2 |
| DOCCI: Descriptions of Connected and Contrasting Images | Apr 30, 2024 | Image GenerationImage to text | —Unverified | 0 |
| Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation | Apr 30, 2024 | Caption GenerationHallucination | —Unverified | 0 |
| Leveraging AI to Generate Audio for User-generated Content in Video Games | Apr 25, 2024 | Audio GenerationGame Design | —Unverified | 0 |
| VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations | Apr 25, 2024 | Image to textSensitivity | CodeCode Available | 0 |
| LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? | Apr 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection | Apr 15, 2024 | Anomaly DetectionAnomaly Localization | —Unverified | 0 |
| CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching | Apr 4, 2024 | AttributeImage Captioning | CodeCode Available | 2 |
| OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation | Apr 1, 2024 | Image SegmentationImage to text | —Unverified | 0 |
| Evaluating Text-to-Visual Generation with Image-to-Text Generation | Apr 1, 2024 | Image to textQuestion Answering | CodeCode Available | 3 |