SOTAVerified

Image Description

Papers

Showing 150 of 154 papers

TitleStatusHype
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learningCode7
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language ModelsCode7
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondCode5
Caption Anything: Interactive Image Description with Diverse Multimodal ControlsCode3
PandaGPT: One Model To Instruction-Follow Them AllCode2
Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert ReasonerCode2
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation ModelCode2
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image DescriptionsCode2
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsCode1
A skeletonization algorithm for gradient-based optimizationCode1
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language ModelingCode1
DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue DatasetCode1
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text GenerationCode1
Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIPCode1
Revisiting Binary Local Image Description for Resource Limited DevicesCode1
Towards image compression with perfect realism at ultra-low bitratesCode1
Grounded Video DescriptionCode1
Text-Visual Semantic Constrained AI-Generated Image Quality AssessmentCode1
Mitigating Hallucinations in Vision-Language Models through Image-Guided Head SuppressionCode1
Chatting Makes Perfect: Chat-based Image RetrievalCode1
CIDEr: Consensus-based Image Description EvaluationCode1
Can Large Multimodal Models Uncover Deep Semantics Behind Images?Code1
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image AnnotationsCode1
SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline ModelsCode1
Focused Evaluation for Image Description with Binary Forced-Choice Tasks0
Computer Vision and Conflicting Values: Describing People with Automated Alt Text0
A Fine-Grained Image Description Generation Method Based on Joint Objectives0
A Genetic Algorithm Approach for ImageRepresentation Learning through Color Quantization0
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching0
From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning0
Comparing Automatic Evaluation Measures for Image Description0
Collecting Image Description Datasets using Crowdsourcing0
Advanced Chest X-Ray Analysis via Transformer-Based Image Descriptors and Cross-Model Attention Mechanism0
Doubly-Attentive Decoder for Multi-modal Neural Machine Translation0
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models0
Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description0
Generating Image Captions in Arabic using Root-Word Based Recurrent Neural Networks and Deep Neural Networks0
Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis0
Artwork Explanation in Large-scale Vision Language Models0
Exploring Visual Relationship for Image Captioning0
DiffCap: Exploring Continuous Diffusion on Image Captioning0
DIDEC: The Dutch Image Description and Eye-tracking Corpus0
A Preliminary Survey of Semantic Descriptive Model for Images0
Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space0
Adding the Third Dimension to Spatial Relation Detection in 2D Images0
Don't Mention the Shoe! A Learning to Rank Approach to Content Selection for Image Description Generation0
Exploring the Behavior of Classic REG Algorithms in the Description of Characters in 3D Images0
Draw and Tell: Multimodal Descriptions Outperform Verbal- or Sketch-Only Descriptions in an Image Retrieval Task0
A Shared Task on Multimodal Machine Translation and Crosslingual Image Description0
Face2Text revisited: Improved data set and baseline results0
Show:102550
← PrevPage 1 of 4Next →

No leaderboard results yet.