RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness May 27, 2024 Hallucination Image Captioning
Code Code Available 115 Chameleon: Mixed-Modal Early-Fusion Foundation Models May 16, 2024 Image Captioning Image Generation
Code Code Available 75 RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback Dec 1, 2023 Hallucination Image Captioning
Code Code Available 65 Versatile Diffusion: Text, Images and Variations All in One Diffusion Model Nov 15, 2022 All Disentanglement
Code Code Available 65 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Jan 28, 2022 Image Captioning Image-text matching
Code Code Available 55 PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation Mar 7, 2024 4k Image Captioning
Code Code Available 55 YOLOR-Based Multi-Task Learning Sep 29, 2023 Image Captioning Instance Segmentation
Code Code Available 55 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Aug 24, 2023 Chart Question Answering FS-MEVQA
Code Code Available 55 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Sep 25, 2024 Image Captioning
Code Code Available 45 GLIPv2: Unifying Localization and Vision-Language Understanding Jun 12, 2022 2D Object Detection Contrastive Learning
Code Code Available 45 LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation Nov 7, 2024 Contrastive Learning Image Captioning
Code Code Available 45 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Jan 30, 2023 Generative Visual Question Answering Image Captioning
Code Code Available 45 GPT-4V(ision) is a Generalist Web Agent, if Grounded Jan 3, 2024 Image Captioning Question Answering
Code Code Available 45 A Survey on Vision-Language-Action Models for Embodied AI May 23, 2024 Image Captioning Instruction Following
Code Code Available 45 Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning May 23, 2025 Decoder Image Captioning
Code Code Available 45 Ludwig: a type-based declarative deep learning toolbox Sep 17, 2019 Decoder Deep Learning
Code Code Available 35 Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models Nov 11, 2023 Image Captioning MMR total
Code Code Available 35 DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models Feb 8, 2022 Diagnostic Image Captioning
Code Code Available 35 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Dec 5, 2024 Contrastive Learning Hallucination
Code Code Available 35 Falcon: A Remote Sensing Vision-Language Foundation Model Mar 14, 2025 Image Captioning image-classification
Code Code Available 35 Emu: Generative Pretraining in Multimodality Jul 11, 2023 Image Captioning Image Generation
Code Code Available 35 WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset May 9, 2023 Articles Image Captioning
Code Code Available 35 View Selection for 3D Captioning via Diffusion Ranking Apr 11, 2024 3D Object Captioning Hallucination
Code Code Available 35 All You May Need for VQA are Image Captions May 4, 2022 All Image Captioning
Code Code Available 35 Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct 17, 2022 Few-Shot Learning Image Captioning
Code Code Available 35 Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding Feb 9, 2025 Image Captioning Image-text Retrieval
Code Code Available 35 SVIT: Scaling up Visual Instruction Tuning Jul 9, 2023 Diversity Image Captioning
Code Code Available 35 TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones Dec 28, 2023 Computational Efficiency Image Captioning
Code Code Available 35 Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey Dec 3, 2024 Change Detection Descriptive
Code Code Available 35 Caption Anything: Interactive Image Description with Diverse Multimodal Controls May 4, 2023 controllable image captioning Image Captioning
Code Code Available 35 Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models Sep 16, 2024 Decoder Diversity
Code Code Available 35 Valley2: Exploring Multimodal Models with Scalable Vision-Language Design Jan 10, 2025 Image Captioning Language Modeling
Code Code Available 35 MeaCap: Memory-Augmented Zero-shot Image Captioning Mar 6, 2024 Caption Generation Image Captioning
Code Code Available 25 MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts Oct 3, 2023 Chatbot Image Captioning
Code Code Available 25 LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models Jun 15, 2023 Hallucination Image Captioning
Code Code Available 25 LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Nov 28, 2023 Image Captioning Question Answering
Code Code Available 25 BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks May 26, 2023 Image Captioning Medical Visual Question Answering
Code Code Available 25 LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 25 Language Models Can See: Plugging Visual Controls in Text Generation May 5, 2022 Image Captioning Image-text matching
Code Code Available 25 A Better Variant of Self-Critical Sequence Training Mar 22, 2020 Image Captioning
Code Code Available 25 JourneyDB: A Benchmark for Generative Image Understanding Jul 3, 2023 Image Captioning Image Comprehension
Code Code Available 25 Learning Vision from Models Rivals Learning Vision from Data Dec 28, 2023 Contrastive Learning Image Captioning
Code Code Available 25 OmniCaptioner: One Captioner to Rule Them All Apr 9, 2025 All Image Captioning
Code Code Available 25 GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 25 GLaMM: Pixel Grounding Large Multimodal Model Nov 6, 2023 Conversational Question Answering Image Captioning
Code Code Available 25 Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts Feb 24, 2025 Benchmarking Fact Verification
Code Code Available 25 From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks Jun 4, 2024 Image Captioning Language Modelling
Code Code Available 25 Frontiers in Intelligent Colonoscopy Oct 22, 2024 Image Captioning
Code Code Available 25 Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model Mar 6, 2025 General Knowledge Image Captioning
Code Code Available 25 Fine-grained Image Captioning with CLIP Reward May 26, 2022 Caption Generation Descriptive
Code Code Available 25