| Magma: A Foundation Model for Multimodal AI Agents | Feb 18, 2025 | Autonomous Web NavigationImage to text | CodeCode Available | 5 |
| UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation | Feb 16, 2025 | Binary ClassificationFake News Detection | —Unverified | 0 |
| UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding | Feb 8, 2025 | DenoisingImage Generation | CodeCode Available | 1 |
| Multi-LLM Collaborative Caption Generation in Scientific Documents | Jan 5, 2025 | Caption GenerationImage to text | CodeCode Available | 0 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 |
| Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training | Jan 1, 2025 | Image-text RetrievalImage to text | —Unverified | 0 |
| Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation | Jan 1, 2025 | image-classificationImage Classification | —Unverified | 0 |
| PromptHash:Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval | Jan 1, 2025 | Contrastive LearningImage Retrieval | CodeCode Available | 0 |
| Survey on Abstractive Text Summarization: Dataset, Models, and Metrics | Dec 22, 2024 | Abstractive Text SummarizationGeneral Knowledge | CodeCode Available | 0 |
| CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIP | Dec 5, 2024 | Anomaly ClassificationAnomaly Detection | CodeCode Available | 0 |
| DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding | Dec 2, 2024 | Caption GenerationDomain Generalization | —Unverified | 0 |
| Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation | Nov 23, 2024 | Cross-Modal RetrievalImage to text | —Unverified | 0 |
| FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training | Nov 18, 2024 | Data AugmentationImage to text | CodeCode Available | 1 |
| Everything is a Video: Unifying Modalities through Next-Frame Prediction | Nov 15, 2024 | Caption GenerationCross-Modal Retrieval | —Unverified | 0 |
| Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models | Nov 8, 2024 | Image CaptioningImage Generation | —Unverified | 0 |
| From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing | Nov 5, 2024 | Change DetectionContrastive Learning | —Unverified | 0 |
| Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization | Oct 30, 2024 | Image to textImage-to-Text Retrieval | —Unverified | 0 |
| Semantic Editing Increment Benefits Zero-Shot Composed Image Retrieval | Oct 28, 2024 | Image RetrievalImage to text | CodeCode Available | 2 |
| Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs) | Oct 25, 2024 | AttributeImage to text | CodeCode Available | 0 |
| Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics | Oct 24, 2024 | Image to textImage-Variation | —Unverified | 0 |
| Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image | Oct 20, 2024 | Image to text | —Unverified | 0 |
| An Online Learning Approach to Prompt-based Selection of Generative Models | Oct 17, 2024 | Image to text | —Unverified | 0 |
| Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models | Oct 7, 2024 | Image to text | —Unverified | 0 |
| Backdooring Vision-Language Models with Out-Of-Distribution Data | Oct 2, 2024 | Image CaptioningImage to text | —Unverified | 0 |
| See then Tell: Enhancing Key Information Extraction with Vision Grounding | Sep 29, 2024 | Image to textKey Information Extraction | —Unverified | 0 |