Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic Jun 27, 2023 Image Captioning Referring Expression Segmentation
Code Code Available 25 Scalable 3D Captioning with Pretrained Models Jun 12, 2023 Descriptive Image Captioning
Code Code Available 25 Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning Jun 11, 2025 Image Captioning Math
Code Code Available 25 Benchmarking and Improving Detail Image Caption May 29, 2024 Benchmarking Image Captioning
Code Code Available 25 Controlling Length in Image Captioning May 29, 2020 Image Captioning
Code Code Available 25 Comprehending and Ordering Semantics for Image Captioning Jun 14, 2022 Cross-Modal Retrieval Image Captioning
Code Code Available 25 RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models Oct 17, 2024 Image Captioning Question Answering
Code Code Available 25 Yo'LLaVA: Your Personalized Language and Vision Assistant Jun 13, 2024 Image Captioning Question Answering
Code Code Available 25 Unified Multimodal Discrete Diffusion Mar 26, 2025 Image Captioning Image Generation
Code Code Available 25 Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts Feb 24, 2025 Benchmarking Fact Verification
Code Code Available 25 Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Oct 7, 2022 Chart Question Answering Diversity
Code Code Available 25 CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching Apr 4, 2024 Attribute Image Captioning
Code Code Available 25 Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Apr 13, 2020 Cross-Modal Retrieval Image Captioning
Code Code Available 25 OmniCaptioner: One Captioner to Rule Them All Apr 9, 2025 All Image Captioning
Code Code Available 25 MeaCap: Memory-Augmented Zero-shot Image Captioning Mar 6, 2024 Caption Generation Image Captioning
Code Code Available 25 OmniSearchSage: Multi-Task Multi-Entity Embeddings for Pinterest Search Apr 25, 2024 Entity Embeddings Image Captioning
Code Code Available 25 PoseScript: Linking 3D Human Poses and Natural Language Oct 21, 2022 Cross-Modal Retrieval Image Captioning
Code Code Available 25 Text-Only Training for Image Captioning using Noise-Injected CLIP Nov 1, 2022 Decoder Image Captioning
Code Code Available 25 LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Nov 28, 2023 Image Captioning Question Answering
Code Code Available 25 Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model Mar 6, 2025 General Knowledge Image Captioning
Code Code Available 25 LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 25 Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction Mar 27, 2024 Image Captioning Language Modeling
Code Code Available 25 VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis Mar 29, 2024 Hallucination Image Captioning
Code Code Available 25 GLaMM: Pixel Grounding Large Multimodal Model Nov 6, 2023 Conversational Question Answering Image Captioning
Code Code Available 25 JourneyDB: A Benchmark for Generative Image Understanding Jul 3, 2023 Image Captioning Image Comprehension
Code Code Available 25 ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions Mar 12, 2023 Image Captioning Question Answering
Code Code Available 25 Language Models Can See: Plugging Visual Controls in Text Generation May 5, 2022 Image Captioning Image-text matching
Code Code Available 25 Learning Vision from Models Rivals Learning Vision from Data Dec 28, 2023 Contrastive Learning Image Captioning
Code Code Available 25 LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models Jun 15, 2023 Hallucination Image Captioning
Code Code Available 25 MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts Oct 3, 2023 Chatbot Image Captioning
Code Code Available 25 From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks Jun 4, 2024 Image Captioning Language Modelling
Code Code Available 25 From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models Oct 13, 2023 Hallucination Image Captioning
Code Code Available 25 Frontiers in Intelligent Colonoscopy Oct 22, 2024 Image Captioning
Code Code Available 25 Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP Oct 9, 2022 Image Captioning Open Vocabulary Semantic Segmentation
Code Code Available 25 Fine-grained Image Captioning with CLIP Reward May 26, 2022 Caption Generation Descriptive
Code Code Available 25 BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks May 26, 2023 Image Captioning Medical Visual Question Answering
Code Code Available 25 Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models Jun 3, 2024 Image Captioning Language Modelling
Code Code Available 25 ClipCap: CLIP Prefix for Image Captioning Nov 18, 2021 Image Captioning Language Modeling
Code Code Available 25 Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions Aug 8, 2023 Caption Generation Image Captioning
Code Code Available 25 EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation Dec 24, 2024 Image Captioning Image Generation
Code Code Available 25 GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 25 CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features May 13, 2019 Domain Generalization Image Captioning
Code Code Available 15 Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model Sep 20, 2024 Image Captioning Panoptic Segmentation
Code Code Available 15 Aesthetically Relevant Image Captioning Nov 25, 2022 Image Captioning Sentence
Code Code Available 15 Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand Dec 8, 2021 Image Captioning Machine Translation
Code Code Available 15 DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training Mar 6, 2023 Decoder Image Captioning
Code Code Available 15 Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation Sep 12, 2023 Image Captioning Image Generation
Code Code Available 15 Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning May 9, 2022 Image Captioning Object
Code Code Available 15 Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model Aug 2, 2023 Hallucination Image Captioning
Code Code Available 15 BERTGEN: Multi-task Generation through BERT Jun 7, 2021 Decoder Image Captioning
Code Code Available 15