| TrojVLM: Backdoor Attack Against Vision Language Models | Sep 28, 2024 | Backdoor AttackImage Captioning | —Unverified | 0 |
| Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization | Sep 26, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| Evaluating authenticity and quality of image captions via sentiment and semantic analyses | Sep 14, 2024 | Image CaptioningImage to text | —Unverified | 0 |
| See or Guess: Counterfactually Regularized Image Captioning | Aug 29, 2024 | Causal Inferencecounterfactual | CodeCode Available | 1 |
| UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation | Aug 21, 2024 | Image GenerationImage Retrieval | CodeCode Available | 1 |
| Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models | Aug 16, 2024 | Image to text | —Unverified | 0 |
| In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation | Aug 9, 2024 | Image to textObject | CodeCode Available | 2 |
| Instruction Tuning-free Visual Token Complement for Multimodal LLMs | Aug 9, 2024 | Image GenerationImage to text | —Unverified | 0 |
| GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models | Jul 30, 2024 | Image to textImage-to-Text Retrieval | CodeCode Available | 0 |
| Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities | Jul 29, 2024 | Contrastive LearningDeepFake Detection | CodeCode Available | 2 |
| Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic | Jul 25, 2024 | Image to textLanguage Modeling | —Unverified | 0 |
| GPC: Generative and General Pathology Image Classifier | Jul 12, 2024 | Classificationimage-classification | —Unverified | 0 |
| LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval | Jul 11, 2024 | Image RetrievalImage to text | CodeCode Available | 2 |
| 15M Multimodal Facial Image-Text Dataset | Jul 11, 2024 | Image to text | —Unverified | 0 |
| Towards a text-based quantitative and explainable histopathology image analysis | Jul 10, 2024 | image-classificationImage Classification | CodeCode Available | 0 |
| Vision-Braille: An End-to-End Tool for Chinese Braille Image-to-Text Translation | Jul 8, 2024 | Image to textLifelong learning | —Unverified | 0 |
| HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels | Jul 8, 2024 | Contrastive LearningImage Retrieval | —Unverified | 0 |
| Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything | Jul 1, 2024 | Image to textLanguage Modeling | —Unverified | 0 |
| A Data-Driven Guided Decoding Mechanism for Diagnostic Captioning | Jun 20, 2024 | DiagnosticImage to text | CodeCode Available | 0 |
| Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags | Jun 16, 2024 | Image to textInstruction Following | —Unverified | 0 |
| BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval | Jun 14, 2024 | Image RetrievalImage to text | CodeCode Available | 0 |
| CMC-Bench: Towards a New Paradigm of Visual Signal Compression | Jun 13, 2024 | Image CompressionImage to text | CodeCode Available | 1 |
| Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval | Jun 11, 2024 | Image RetrievalImage to text | —Unverified | 0 |
| Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning | Jun 11, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| AICoderEval: Improving AI Domain Code Generation of Large Language Models | Jun 7, 2024 | Code GenerationImage to text | —Unverified | 0 |