RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness May 27, 2024 Hallucination Image Captioning
Code Code Available 11Chameleon: Mixed-Modal Early-Fusion Foundation Models May 16, 2024 Image Captioning Image Generation
Code Code Available 7RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback Dec 1, 2023 Hallucination Image Captioning
Code Code Available 6Versatile Diffusion: Text, Images and Variations All in One Diffusion Model Nov 15, 2022 All Disentanglement
Code Code Available 6PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation Mar 7, 2024 4k Image Captioning
Code Code Available 5YOLOR-Based Multi-Task Learning Sep 29, 2023 Image Captioning Instance Segmentation
Code Code Available 5Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Aug 24, 2023 Chart Question Answering FS-MEVQA
Code Code Available 5BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Jan 28, 2022 Image Captioning Image-text matching
Code Code Available 5Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning May 23, 2025 Decoder Image Captioning
Code Code Available 4LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation Nov 7, 2024 Contrastive Learning Image Captioning
Code Code Available 4Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Sep 25, 2024 Image Captioning
Code Code Available 4A Survey on Vision-Language-Action Models for Embodied AI May 23, 2024 Image Captioning Instruction Following
Code Code Available 4GPT-4V(ision) is a Generalist Web Agent, if Grounded Jan 3, 2024 Image Captioning Question Answering
Code Code Available 4BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Jan 30, 2023 Generative Visual Question Answering Image Captioning
Code Code Available 4GLIPv2: Unifying Localization and Vision-Language Understanding Jun 12, 2022 2D Object Detection Contrastive Learning
Code Code Available 4Falcon: A Remote Sensing Vision-Language Foundation Model Mar 14, 2025 Image Captioning image-classification
Code Code Available 3Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding Feb 9, 2025 Image Captioning Image-text Retrieval
Code Code Available 3Valley2: Exploring Multimodal Models with Scalable Vision-Language Design Jan 10, 2025 Image Captioning Language Modeling
Code Code Available 3Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Dec 5, 2024 Contrastive Learning Hallucination
Code Code Available 3Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey Dec 3, 2024 Change Detection Descriptive
Code Code Available 3Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models Sep 16, 2024 Decoder Diversity
Code Code Available 3View Selection for 3D Captioning via Diffusion Ranking Apr 11, 2024 3D Object Captioning Hallucination
Code Code Available 3TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones Dec 28, 2023 Computational Efficiency Image Captioning
Code Code Available 3Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models Nov 11, 2023 Image Captioning MMR total
Code Code Available 3Emu: Generative Pretraining in Multimodality Jul 11, 2023 Image Captioning Image Generation
Code Code Available 3SVIT: Scaling up Visual Instruction Tuning Jul 9, 2023 Diversity Image Captioning
Code Code Available 3WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset May 9, 2023 Articles Image Captioning
Code Code Available 3Caption Anything: Interactive Image Description with Diverse Multimodal Controls May 4, 2023 controllable image captioning Image Captioning
Code Code Available 3Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct 17, 2022 Few-Shot Learning Image Captioning
Code Code Available 3All You May Need for VQA are Image Captions May 4, 2022 All Image Captioning
Code Code Available 3DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models Feb 8, 2022 Diagnostic Image Captioning
Code Code Available 3Ludwig: a type-based declarative deep learning toolbox Sep 17, 2019 Decoder Deep Learning
Code Code Available 3Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning Jun 11, 2025 Image Captioning Math
Code Code Available 2OmniCaptioner: One Captioner to Rule Them All Apr 9, 2025 All Image Captioning
Code Code Available 2Unified Multimodal Discrete Diffusion Mar 26, 2025 Image Captioning Image Generation
Code Code Available 2Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model Mar 6, 2025 General Knowledge Image Captioning
Code Code Available 2Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts Feb 24, 2025 Benchmarking Fact Verification
Code Code Available 2EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation Dec 24, 2024 Image Captioning Image Generation
Code Code Available 2Frontiers in Intelligent Colonoscopy Oct 22, 2024 Image Captioning
Code Code Available 2TIPS: Text-Image Pretraining with Spatial Awareness Oct 21, 2024 Depth Estimation Image Captioning
Code Code Available 2RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models Oct 17, 2024 Image Captioning Question Answering
Code Code Available 2VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding Jun 18, 2024 Image Captioning Question Answering
Code Code Available 2Towards Vision-Language Geo-Foundation Model: A Survey Jun 13, 2024 Earth Observation Image Captioning
Code Code Available 2Yo'LLaVA: Your Personalized Language and Vision Assistant Jun 13, 2024 Image Captioning Question Answering
Code Code Available 2From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks Jun 4, 2024 Image Captioning Language Modelling
Code Code Available 2Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models Jun 3, 2024 Image Captioning Language Modelling
Code Code Available 2Benchmarking and Improving Detail Image Caption May 29, 2024 Benchmarking Image Captioning
Code Code Available 2CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts May 9, 2024 Image Captioning Instruction Following
Code Code Available 2OmniSearchSage: Multi-Task Multi-Entity Embeddings for Pinterest Search Apr 25, 2024 Entity Embeddings Image Captioning
Code Code Available 2CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching Apr 4, 2024 Attribute Image Captioning
Code Code Available 2