| Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval | Jun 11, 2024 | Image RetrievalImage to text | —Unverified | 0 |
| Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning | Jun 11, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| AICoderEval: Improving AI Domain Code Generation of Large Language Models | Jun 7, 2024 | Code GenerationImage to text | —Unverified | 0 |
| Faithful Chart Summarization with ChaTS-Pi | May 29, 2024 | Image to textSentence | —Unverified | 0 |
| Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning | May 26, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation | May 23, 2024 | Image to textSentence | CodeCode Available | 0 |
| DOCCI: Descriptions of Connected and Contrasting Images | Apr 30, 2024 | Image GenerationImage to text | —Unverified | 0 |
| Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation | Apr 30, 2024 | Caption GenerationHallucination | —Unverified | 0 |
| Leveraging AI to Generate Audio for User-generated Content in Video Games | Apr 25, 2024 | Audio GenerationGame Design | —Unverified | 0 |
| VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations | Apr 25, 2024 | Image to textSensitivity | CodeCode Available | 0 |
| Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection | Apr 15, 2024 | Anomaly DetectionAnomaly Localization | —Unverified | 0 |
| OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation | Apr 1, 2024 | Image SegmentationImage to text | —Unverified | 0 |
| BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval | Mar 24, 2024 | DiagnosticImage Retrieval | —Unverified | 0 |
| Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation | Mar 14, 2024 | Image to textOptical Character Recognition (OCR) | —Unverified | 0 |
| CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? | Mar 7, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant | Mar 7, 2024 | Clinical KnowledgeImage to text | —Unverified | 0 |
| Enhancing Vision-Language Pre-training with Rich Supervisions | Mar 5, 2024 | Image to textTable Detection | —Unverified | 0 |
| Attention Guidance Mechanism for Handwritten Mathematical Expression Recognition | Mar 4, 2024 | Image to text | —Unverified | 0 |
| Probing Multimodal Large Language Models for Global and Local Semantic Representations | Feb 27, 2024 | Image to textobject-detection | CodeCode Available | 0 |
| A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models | Feb 21, 2024 | BenchmarkingImage to text | —Unverified | 0 |
| Captions Are Worth a Thousand Words: Enhancing Product Retrieval with Pretrained Image-to-Text Models | Feb 13, 2024 | Image CaptioningImage to text | —Unverified | 0 |
| Dynamic Traceback Learning for Medical Report Generation | Jan 24, 2024 | Image to textMedical Report Generation | —Unverified | 0 |
| CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | Jan 5, 2024 | Image ComprehensionImage to text | —Unverified | 0 |
| SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment | Jan 4, 2024 | Image Captioningimage-classification | —Unverified | 0 |
| Accept the Modality Gap: An Exploration in the Hyperbolic Space | Jan 1, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 |