| TrojVLM: Backdoor Attack Against Vision Language Models | Sep 28, 2024 | Backdoor AttackImage Captioning | —Unverified | 0 |
| Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization | Sep 26, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| Evaluating authenticity and quality of image captions via sentiment and semantic analyses | Sep 14, 2024 | Image CaptioningImage to text | —Unverified | 0 |
| See or Guess: Counterfactually Regularized Image Captioning | Aug 29, 2024 | Causal Inferencecounterfactual | CodeCode Available | 1 |
| UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation | Aug 21, 2024 | Image GenerationImage Retrieval | CodeCode Available | 1 |
| Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models | Aug 16, 2024 | Image to text | —Unverified | 0 |
| Instruction Tuning-free Visual Token Complement for Multimodal LLMs | Aug 9, 2024 | Image GenerationImage to text | —Unverified | 0 |
| In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation | Aug 9, 2024 | Image to textObject | CodeCode Available | 2 |
| GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models | Jul 30, 2024 | Image to textImage-to-Text Retrieval | CodeCode Available | 0 |
| Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities | Jul 29, 2024 | Contrastive LearningDeepFake Detection | CodeCode Available | 2 |
| Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic | Jul 25, 2024 | Image to textLanguage Modeling | —Unverified | 0 |
| GPC: Generative and General Pathology Image Classifier | Jul 12, 2024 | Classificationimage-classification | —Unverified | 0 |
| LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval | Jul 11, 2024 | Image RetrievalImage to text | CodeCode Available | 2 |
| 15M Multimodal Facial Image-Text Dataset | Jul 11, 2024 | Image to text | —Unverified | 0 |
| Towards a text-based quantitative and explainable histopathology image analysis | Jul 10, 2024 | image-classificationImage Classification | CodeCode Available | 0 |
| HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels | Jul 8, 2024 | Contrastive LearningImage Retrieval | —Unverified | 0 |
| Vision-Braille: An End-to-End Tool for Chinese Braille Image-to-Text Translation | Jul 8, 2024 | Image to textLifelong learning | —Unverified | 0 |
| Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything | Jul 1, 2024 | Image to textLanguage Modeling | —Unverified | 0 |
| A Data-Driven Guided Decoding Mechanism for Diagnostic Captioning | Jun 20, 2024 | DiagnosticImage to text | CodeCode Available | 0 |
| Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags | Jun 16, 2024 | Image to textInstruction Following | —Unverified | 0 |
| BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval | Jun 14, 2024 | Image RetrievalImage to text | CodeCode Available | 0 |
| CMC-Bench: Towards a New Paradigm of Visual Signal Compression | Jun 13, 2024 | Image CompressionImage to text | CodeCode Available | 1 |
| Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval | Jun 11, 2024 | Image RetrievalImage to text | —Unverified | 0 |
| Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning | Jun 11, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| AICoderEval: Improving AI Domain Code Generation of Large Language Models | Jun 7, 2024 | Code GenerationImage to text | —Unverified | 0 |
| Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design | May 29, 2024 | Dataset GenerationImage to text | CodeCode Available | 1 |
| Faithful Chart Summarization with ChaTS-Pi | May 29, 2024 | Image to textSentence | —Unverified | 0 |
| Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning | May 26, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation | May 23, 2024 | Image to textSentence | CodeCode Available | 0 |
| Libra: Building Decoupled Vision System on Large Language Models | May 16, 2024 | Image to textLanguage Modeling | CodeCode Available | 2 |
| Language-Oriented Semantic Latent Representation for Image Transmission | May 16, 2024 | Image to textSemantic Communication | CodeCode Available | 1 |
| DOCCI: Descriptions of Connected and Contrasting Images | Apr 30, 2024 | Image GenerationImage to text | —Unverified | 0 |
| Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation | Apr 30, 2024 | Caption GenerationHallucination | —Unverified | 0 |
| Leveraging AI to Generate Audio for User-generated Content in Video Games | Apr 25, 2024 | Audio GenerationGame Design | —Unverified | 0 |
| VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations | Apr 25, 2024 | Image to textSensitivity | CodeCode Available | 0 |
| LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? | Apr 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection | Apr 15, 2024 | Anomaly DetectionAnomaly Localization | —Unverified | 0 |
| CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching | Apr 4, 2024 | AttributeImage Captioning | CodeCode Available | 2 |
| OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation | Apr 1, 2024 | Image SegmentationImage to text | —Unverified | 0 |
| From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models | Apr 1, 2024 | Graph GenerationImage to text | CodeCode Available | 2 |
| Evaluating Text-to-Visual Generation with Image-to-Text Generation | Apr 1, 2024 | Image to textQuestion Answering | CodeCode Available | 3 |
| BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval | Mar 24, 2024 | DiagnosticImage Retrieval | —Unverified | 0 |
| Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation | Mar 14, 2024 | Image to textOptical Character Recognition (OCR) | —Unverified | 0 |
| ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes | Mar 7, 2024 | Image to textObject | CodeCode Available | 1 |
| MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant | Mar 7, 2024 | Clinical KnowledgeImage to text | —Unverified | 0 |
| CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? | Mar 7, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| Enhancing Vision-Language Pre-training with Rich Supervisions | Mar 5, 2024 | Image to textTable Detection | —Unverified | 0 |
| Attention Guidance Mechanism for Handwritten Mathematical Expression Recognition | Mar 4, 2024 | Image to text | —Unverified | 0 |
| Probing Multimodal Large Language Models for Global and Local Semantic Representations | Feb 27, 2024 | Image to textobject-detection | CodeCode Available | 0 |
| A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models | Feb 21, 2024 | BenchmarkingImage to text | —Unverified | 0 |