VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis Mar 29, 2024 Hallucination Image Captioning
Code Code Available 2Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction Mar 27, 2024 Image Captioning Language Modeling
Code Code Available 2VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning Mar 19, 2024 Benchmarking Image Captioning
Code Code Available 2Beyond Text: Frozen Large Language Models in Visual Signal Comprehension Mar 12, 2024 Deblurring Decoder
Code Code Available 2MeaCap: Memory-Augmented Zero-shot Image Captioning Mar 6, 2024 Caption Generation Image Captioning
Code Code Available 2VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT Mar 4, 2024 Image Captioning Zero-shot Moment Retrieval
Code Code Available 2Learning Vision from Models Rivals Learning Vision from Data Dec 28, 2023 Contrastive Learning Image Captioning
Code Code Available 2VCoder: Versatile Vision Encoders for Multimodal Large Language Models Dec 21, 2023 Image Captioning Image Generation
Code Code Available 2LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Nov 28, 2023 Image Captioning Question Answering
Code Code Available 2GLaMM: Pixel Grounding Large Multimodal Model Nov 6, 2023 Conversational Question Answering Image Captioning
Code Code Available 2From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models Oct 13, 2023 Hallucination Image Captioning
Code Code Available 2MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts Oct 3, 2023 Chatbot Image Captioning
Code Code Available 2Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions Aug 8, 2023 Caption Generation Image Captioning
Code Code Available 2JourneyDB: A Benchmark for Generative Image Understanding Jul 3, 2023 Image Captioning Image Comprehension
Code Code Available 2LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 2Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic Jun 27, 2023 Image Captioning Referring Expression Segmentation
Code Code Available 2LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models Jun 15, 2023 Hallucination Image Captioning
Code Code Available 2Scalable 3D Captioning with Pretrained Models Jun 12, 2023 Descriptive Image Captioning
Code Code Available 2Contextual Object Detection with Multimodal Large Language Models May 29, 2023 Cloze Test Decoder
Code Code Available 2VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset May 29, 2023 Audio captioning Audio-Visual Captioning
Code Code Available 2BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks May 26, 2023 Image Captioning Medical Visual Question Answering
Code Code Available 2VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 2ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions Mar 12, 2023 Image Captioning Question Answering
Code Code Available 2Semantic-Conditional Diffusion Networks for Image Captioning Dec 6, 2022 Cross-Modal Retrieval Decoder
Code Code Available 2X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Nov 22, 2022 All Cross-Modal Retrieval
Code Code Available 2Text-Only Training for Image Captioning using Noise-Injected CLIP Nov 1, 2022 Decoder Image Captioning
Code Code Available 2PoseScript: Linking 3D Human Poses and Natural Language Oct 21, 2022 Cross-Modal Retrieval Image Captioning
Code Code Available 2Visual Language Maps for Robot Navigation Oct 11, 2022 3D Reconstruction Image Captioning
Code Code Available 2Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP Oct 9, 2022 Image Captioning Open Vocabulary Semantic Segmentation
Code Code Available 2Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Oct 7, 2022 Chart Question Answering Diversity
Code Code Available 2Comprehending and Ordering Semantics for Image Captioning Jun 14, 2022 Cross-Modal Retrieval Image Captioning
Code Code Available 2Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs Jun 9, 2022 Image Captioning Image Classification
Code Code Available 2GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 2Fine-grained Image Captioning with CLIP Reward May 26, 2022 Caption Generation Descriptive
Code Code Available 2Language Models Can See: Plugging Visual Controls in Text Generation May 5, 2022 Image Captioning Image-text matching
Code Code Available 2ClipCap: CLIP Prefix for Image Captioning Nov 18, 2021 Image Captioning Language Modeling
Code Code Available 2VinVL: Revisiting Visual Representations in Vision-Language Models Jan 2, 2021 Image Captioning Image-text matching
Code Code Available 2Controlling Length in Image Captioning May 29, 2020 Image Captioning
Code Code Available 2Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Apr 13, 2020 Cross-Modal Retrieval Image Captioning
Code Code Available 2A Better Variant of Self-Critical Sequence Training Mar 22, 2020 Image Captioning
Code Code Available 2Unified Vision-Language Pre-Training for Image Captioning and VQA Sep 24, 2019 Decoder Image Captioning
Code Code Available 2ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs Jun 11, 2025 Code Generation Diagnostic
Code Code Available 1DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval Jun 10, 2025 Image Captioning Retrieval
Code Code Available 1Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint May 29, 2025 Image Captioning Question Answering
Code Code Available 1SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards May 25, 2025 Image Captioning Multimodal Reasoning
Code Code Available 1SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging Apr 14, 2025 Anomaly Detection Diagnostic
Code Code Available 1A Survey on Efficient Vision-Language Models Apr 13, 2025 Image Captioning Question Answering
Code Code Available 1Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models Mar 25, 2025 Benchmarking Image Captioning
Code Code Available 1Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives Mar 18, 2025 Image Captioning
Code Code Available 1Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition Mar 16, 2025 Caption Generation Image Captioning
Code Code Available 1