Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos Jul 16, 2025 Image Captioning Representation Learning
— Unverified 0Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval Jun 28, 2025 Cross-Modal Retrieval Image Captioning
— Unverified 0HalLoc: Token-level Localization of Hallucinations for Vision Language Models Jun 12, 2025 Hallucination Image Captioning
Code Code Available 0Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning Jun 11, 2025 Image Captioning Math
Code Code Available 2ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs Jun 11, 2025 Code Generation Diagnostic
Code Code Available 1A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning Jun 11, 2025 Decoder Image Captioning
— Unverified 0An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models Jun 10, 2025 Action Generation Image Captioning
— Unverified 0DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval Jun 10, 2025 Image Captioning Retrieval
Code Code Available 1Edit Flows: Flow Matching with Edit Operations Jun 10, 2025 Code Generation Image Captioning
— Unverified 0Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings Jun 10, 2025 Image Captioning
Code Code Available 0Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring Jun 10, 2025 Image Captioning
— Unverified 0GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition Jun 9, 2025 Image Captioning
Code Code Available 0Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning Jun 8, 2025 Attribute Hallucination
— Unverified 0Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation Jun 7, 2025 Camouflaged Object Segmentation Feature Correlation
Code Code Available 0SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs Jun 5, 2025 backdoor defense Image Captioning
— Unverified 0Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation Jun 3, 2025 Caption Generation Image Captioning
— Unverified 0Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models May 30, 2025 Image Captioning Question Answering
— Unverified 0Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint May 29, 2025 Image Captioning Question Answering
Code Code Available 1CLDTracker: A Comprehensive Language Description for Visual Tracking May 29, 2025 Image Captioning Visual Tracking
Code Code Available 0Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport May 29, 2025 Document Level Machine Translation Image Captioning
Code Code Available 0Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model May 29, 2025 Image Captioning Language Modeling
— Unverified 0Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain) May 26, 2025 Image Captioning
Code Code Available 0SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards May 25, 2025 Image Captioning Multimodal Reasoning
Code Code Available 1TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP May 24, 2025 Image Captioning Image Generation
— Unverified 0Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning May 23, 2025 Decoder Image Captioning
Code Code Available 4Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation May 22, 2025 Hallucination Image Captioning
— Unverified 0Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics May 22, 2025 Image Captioning text similarity
— Unverified 0SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval May 21, 2025 counterfactual Graph Generation
Code Code Available 0MedBLIP: Fine-tuning BLIP for Medical Image Captioning May 20, 2025 Decoder Image Captioning
— Unverified 0NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI May 20, 2025 Anomaly Localization Benchmarking
— Unverified 0RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding May 20, 2025 Image Captioning Question Answering
Code Code Available 0Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models May 20, 2025 Hallucination Image Captioning
— Unverified 0Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping May 19, 2025 Contrastive Learning Cross-Modal Retrieval
— Unverified 0Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models May 16, 2025 Image Captioning Question Answering
Code Code Available 0Cross-Image Contrastive Decoding: Precise, Lossless Suppression of Language Priors in Large Vision-Language Models May 15, 2025 Image Captioning Language Modeling
— Unverified 0A Grounded Memory System For Smart Personal Assistants May 9, 2025 Entity Disambiguation Image Captioning
— Unverified 0Describe Anything in Medical Images May 9, 2025 Attribute Diagnostic
— Unverified 0ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding May 9, 2025 Image Captioning Object Recognition
— Unverified 0Mitigating Image Captioning Hallucinations in Vision-Language Models May 6, 2025 Hallucination Hallucination Evaluation
— Unverified 0Compositional Image-Text Matching and Retrieval by Grounding Entities May 4, 2025 Image Captioning Image-text matching
Code Code Available 0Transferable Adversarial Attacks on Black-Box Vision-Language Models May 2, 2025 Image Captioning Object Recognition
— Unverified 0Zoomer: Adaptive Image Focus Optimization for Black-box MLLM Apr 30, 2025 Image Captioning Object Recognition
— Unverified 0MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation Apr 29, 2025 cross-modal alignment Decoder
Code Code Available 0Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning Apr 21, 2025 Image Captioning
— Unverified 0Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding Apr 20, 2025 Autonomous Driving Image Captioning
Code Code Available 0Generalized Visual Relation Detection with Diffusion Models Apr 16, 2025 Graph Generation Human-Object Interaction Detection
— Unverified 0LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation Apr 15, 2025 Image Captioning Question Answering
— Unverified 0TADACap: Time-series Adaptive Domain-Aware Captioning Apr 15, 2025 Image Captioning Retrieval
— Unverified 0Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks Apr 14, 2025 Ethics Fairness
— Unverified 0SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging Apr 14, 2025 Anomaly Detection Diagnostic
Code Code Available 1