RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness May 27, 2024 Hallucination Image Captioning
Code Code Available 11Chameleon: Mixed-Modal Early-Fusion Foundation Models May 16, 2024 Image Captioning Image Generation
Code Code Available 7RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback Dec 1, 2023 Hallucination Image Captioning
Code Code Available 6Versatile Diffusion: Text, Images and Variations All in One Diffusion Model Nov 15, 2022 All Disentanglement
Code Code Available 6PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation Mar 7, 2024 4k Image Captioning
Code Code Available 5YOLOR-Based Multi-Task Learning Sep 29, 2023 Image Captioning Instance Segmentation
Code Code Available 5Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Aug 24, 2023 Chart Question Answering FS-MEVQA
Code Code Available 5BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Jan 28, 2022 Image Captioning Image-text matching
Code Code Available 5Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Sep 25, 2024 Image Captioning
Code Code Available 4BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Jan 30, 2023 Generative Visual Question Answering Image Captioning
Code Code Available 4GLIPv2: Unifying Localization and Vision-Language Understanding Jun 12, 2022 2D Object Detection Contrastive Learning
Code Code Available 4GPT-4V(ision) is a Generalist Web Agent, if Grounded Jan 3, 2024 Image Captioning Question Answering
Code Code Available 4A Survey on Vision-Language-Action Models for Embodied AI May 23, 2024 Image Captioning Instruction Following
Code Code Available 4Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning May 23, 2025 Decoder Image Captioning
Code Code Available 4LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation Nov 7, 2024 Contrastive Learning Image Captioning
Code Code Available 4Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models Nov 11, 2023 Image Captioning MMR total
Code Code Available 3Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models Sep 16, 2024 Decoder Diversity
Code Code Available 3DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models Feb 8, 2022 Diagnostic Image Captioning
Code Code Available 3Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Dec 5, 2024 Contrastive Learning Hallucination
Code Code Available 3View Selection for 3D Captioning via Diffusion Ranking Apr 11, 2024 3D Object Captioning Hallucination
Code Code Available 3SVIT: Scaling up Visual Instruction Tuning Jul 9, 2023 Diversity Image Captioning
Code Code Available 3Ludwig: a type-based declarative deep learning toolbox Sep 17, 2019 Decoder Deep Learning
Code Code Available 3Emu: Generative Pretraining in Multimodality Jul 11, 2023 Image Captioning Image Generation
Code Code Available 3Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct 17, 2022 Few-Shot Learning Image Captioning
Code Code Available 3Falcon: A Remote Sensing Vision-Language Foundation Model Mar 14, 2025 Image Captioning image-classification
Code Code Available 3WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset May 9, 2023 Articles Image Captioning
Code Code Available 3Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding Feb 9, 2025 Image Captioning Image-text Retrieval
Code Code Available 3TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones Dec 28, 2023 Computational Efficiency Image Captioning
Code Code Available 3All You May Need for VQA are Image Captions May 4, 2022 All Image Captioning
Code Code Available 3Caption Anything: Interactive Image Description with Diverse Multimodal Controls May 4, 2023 controllable image captioning Image Captioning
Code Code Available 3Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey Dec 3, 2024 Change Detection Descriptive
Code Code Available 3Valley2: Exploring Multimodal Models with Scalable Vision-Language Design Jan 10, 2025 Image Captioning Language Modeling
Code Code Available 3MeaCap: Memory-Augmented Zero-shot Image Captioning Mar 6, 2024 Caption Generation Image Captioning
Code Code Available 2MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts Oct 3, 2023 Chatbot Image Captioning
Code Code Available 2LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models Jun 15, 2023 Hallucination Image Captioning
Code Code Available 2LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 2LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Nov 28, 2023 Image Captioning Question Answering
Code Code Available 2Language Models Can See: Plugging Visual Controls in Text Generation May 5, 2022 Image Captioning Image-text matching
Code Code Available 2A Better Variant of Self-Critical Sequence Training Mar 22, 2020 Image Captioning
Code Code Available 2BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks May 26, 2023 Image Captioning Medical Visual Question Answering
Code Code Available 2Learning Vision from Models Rivals Learning Vision from Data Dec 28, 2023 Contrastive Learning Image Captioning
Code Code Available 2OmniCaptioner: One Captioner to Rule Them All Apr 9, 2025 All Image Captioning
Code Code Available 2Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts Feb 24, 2025 Benchmarking Fact Verification
Code Code Available 2Benchmarking and Improving Detail Image Caption May 29, 2024 Benchmarking Image Captioning
Code Code Available 2GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 2GLaMM: Pixel Grounding Large Multimodal Model Nov 6, 2023 Conversational Question Answering Image Captioning
Code Code Available 2From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models Oct 13, 2023 Hallucination Image Captioning
Code Code Available 2Beyond Text: Frozen Large Language Models in Visual Signal Comprehension Mar 12, 2024 Deblurring Decoder
Code Code Available 2Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model Mar 6, 2025 General Knowledge Image Captioning
Code Code Available 2From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks Jun 4, 2024 Image Captioning Language Modelling
Code Code Available 2