SOTAVerified

Image Description

Papers

Showing 150 of 154 papers

TitleStatusHype
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learningCode7
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language ModelsCode7
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondCode5
Caption Anything: Interactive Image Description with Diverse Multimodal ControlsCode3
Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert ReasonerCode2
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image DescriptionsCode2
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation ModelCode2
PandaGPT: One Model To Instruction-Follow Them AllCode2
Revisiting Binary Local Image Description for Resource Limited DevicesCode1
A skeletonization algorithm for gradient-based optimizationCode1
CIDEr: Consensus-based Image Description EvaluationCode1
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text GenerationCode1
Chatting Makes Perfect: Chat-based Image RetrievalCode1
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image AnnotationsCode1
Towards image compression with perfect realism at ultra-low bitratesCode1
Mitigating Hallucinations in Vision-Language Models through Image-Guided Head SuppressionCode1
SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline ModelsCode1
Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIPCode1
DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue DatasetCode1
Can Large Multimodal Models Uncover Deep Semantics Behind Images?Code1
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language ModelingCode1
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsCode1
Text-Visual Semantic Constrained AI-Generated Image Quality AssessmentCode1
Grounded Video DescriptionCode1
ContextRef: Evaluating Referenceless Metrics For Image Description GenerationCode0
Human Attention in Image Captioning: Dataset and AnalysisCode0
Compositional Obverter Communication Learning From Raw Visual InputCode0
Pragmatic factors in image description: the case of negationsCode0
Multimodal Word Sense Disambiguation in Creative PracticeCode0
Contextualize, Show and Tell: A Neural Visual StorytellerCode0
On Architectures for Including Visual Information in Neural Language Models for Image DescriptionCode0
CIDEr-R: Robust Consensus-based Image Description EvaluationCode0
Multi30K: Multilingual English-German Image DescriptionsCode0
Multilingual Image Description with Neural Sequence ModelsCode0
Room for improvement in automatic image description: an error analysisCode0
Measuring the Diversity of Automatic Image DescriptionsCode0
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal DatasetsCode0
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning StepsCode0
Localized Symbolic Knowledge Distillation for Visual Commonsense ModelsCode0
Long-term Recurrent Convolutional Networks for Visual Recognition and DescriptionCode0
Describing Videos by Exploiting Temporal StructureCode0
Bridging Languages through Images with Deep Partial Canonical Correlation AnalysisCode0
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information RetrievalCode0
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMsCode0
Deep Imbalanced Attribute Classification using Visual Attention AggregationCode0
Does Multimodality Help Human and Machine for Translation and Image Captioning?Code0
Bounding and Filling: A Fast and Flexible Framework for Image CaptioningCode0
IDEA: Image Description Enhanced CLIP-AdapterCode0
Efficient Decentralized Visual Place Recognition From Full-Image DescriptorsCode0
Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human GazeCode0
Show:102550
← PrevPage 1 of 4Next →

No leaderboard results yet.