ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting Feb 20, 2025 Image Captioning multimodal interaction
— Unverified 0What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness Feb 19, 2025 Image Captioning Keyword Extraction
— Unverified 0InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models Feb 19, 2025 Image Captioning
— Unverified 0A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models Feb 19, 2025 Image Captioning Language Modeling
— Unverified 0Pretrained Image-Text Models are Secretly Video Captioners Feb 19, 2025 Image Captioning Video Captioning
Code Code Available 0GroundCap: A Visually Grounded Image Captioning Dataset Feb 19, 2025 Image Captioning Object Detection
— Unverified 0TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and Multi-Modal Purification Modules Feb 16, 2025 GPU Image Captioning
— Unverified 0VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models Feb 14, 2025 Image Captioning Large Language Model
— Unverified 0GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis Feb 13, 2025 Cross-Modal Retrieval Image Captioning
Code Code Available 1FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning Feb 13, 2025 Caption Generation Decoder
— Unverified 0Vision-Language Models for Edge Networks: A Comprehensive Survey Feb 11, 2025 Autonomous Vehicles Image Captioning
— Unverified 0Evaluation of Multilingual Image Captioning: How far can we get with CLIP models? Feb 10, 2025 Image Captioning Semantic correspondence
Code Code Available 0Generative Distribution Prediction: A Unified Approach to Multimodal Learning Feb 10, 2025 Domain Adaptation Image Captioning
— Unverified 0Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding Feb 9, 2025 Image Captioning Image-text Retrieval
Code Code Available 3Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents Feb 6, 2025 Image Captioning Optical Character Recognition
— Unverified 0Efficient Few-Shot Continual Learning in Vision-Language Models Feb 6, 2025 Continual Learning Image Captioning
— Unverified 0TexLiDAR: Automated Text Understanding for Panoramic LiDAR Data Feb 5, 2025 Image Captioning object-detection
Code Code Available 0Exploring Spatial Language Grounding Through Referring Expressions Feb 4, 2025 Image Captioning Negation
— Unverified 0COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation Feb 4, 2025 Image Captioning Panoptic Segmentation
— Unverified 0Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models Feb 3, 2025 Adversarial Robustness Image Captioning
Code Code Available 1MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding Jan 30, 2025 Benchmarking Decision Making
— Unverified 0Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes Jan 23, 2025 Emotion Classification Image Captioning
Code Code Available 0An Ensemble Model with Attention Based Mechanism for Image Captioning Jan 22, 2025 Ensemble Learning Image Captioning
— Unverified 0PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model Jan 21, 2025 Hallucination Image Captioning
Code Code Available 1Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis Jan 16, 2025 Decoder Image Captioning
Code Code Available 0LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport Jan 16, 2025 AudioCaps Audio captioning
Code Code Available 1Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness Jan 16, 2025 Adversarial Defense Adversarial Robustness
— Unverified 0VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall Jan 15, 2025 Image Captioning
— Unverified 0RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment Jan 13, 2025 Concept Alignment Image Captioning
Code Code Available 1GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing Jan 12, 2025 Image Captioning Language Modeling
— Unverified 0Valley2: Exploring Multimodal Models with Scalable Vision-Language Design Jan 10, 2025 Image Captioning Language Modeling
Code Code Available 3Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time Jan 8, 2025 Image Captioning Style Transfer
— Unverified 0Evaluating Image Caption via Cycle-consistent Text-to-Image Generation Jan 7, 2025 Contrastive Learning Diversity
— Unverified 0Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Jan 5, 2025 Image Captioning Image to text
Code Code Available 1Decoding fMRI Data into Captions using Prefix Language Modeling Jan 5, 2025 Brain Decoding Image Captioning
Code Code Available 0MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning Jan 3, 2025 Diagnostic General Knowledge
— Unverified 0Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception Jan 1, 2025 Image Captioning Image Generation
— Unverified 0AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation Jan 1, 2025 Image Captioning Question Answering
— Unverified 0Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning Jan 1, 2025 cross-modal alignment Denoising
Code Code Available 1Semantic and Expressive Variations in Image Captions Across Languages Jan 1, 2025 Descriptive Image Captioning
— Unverified 0Variance-Based Membership Inference Attacks Against Large-Scale Image Captioning Models Jan 1, 2025 Image Captioning Memorization
— Unverified 0Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution Jan 1, 2025 Depth Estimation Image Captioning
— Unverified 0Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning Dec 31, 2024 Caption Generation Decoder
— Unverified 0Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering Dec 30, 2024 Image Captioning Object Recognition
— Unverified 0ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers Dec 27, 2024 Image Captioning Question Answering
— Unverified 0ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning Dec 26, 2024 Image Captioning Retrieval
Code Code Available 0EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation Dec 24, 2024 Image Captioning Image Generation
Code Code Available 2Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy Dec 23, 2024 Image Captioning Question Answering
— Unverified 0GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning Dec 23, 2024 Image Captioning Language Modeling
— Unverified 0SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization Dec 21, 2024 Image Captioning Multimodal Reasoning
Code Code Available 0