SOTAVerified

Image Description

Papers

Showing 150 of 154 papers

TitleStatusHype
Text-Visual Semantic Constrained AI-Generated Image Quality AssessmentCode1
Mitigating Hallucinations in Vision-Language Models through Image-Guided Head SuppressionCode1
Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert ReasonerCode2
Advanced Chest X-Ray Analysis via Transformer-Based Image Descriptors and Cross-Model Attention Mechanism0
LaMOuR: Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning0
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation ModelCode2
VisBias: Measuring Explicit and Implicit Social Biases in Vision Language ModelsCode0
SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline ModelsCode1
Boli: A dataset for understanding stuttering experience and analyzing stuttered speech0
IDEA: Image Description Enhanced CLIP-AdapterCode0
A Preliminary Survey of Semantic Descriptive Model for Images0
Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis0
RRHF-V: Ranking Responses to Mitigate Hallucinations in Multimodal Large Language Models with Human FeedbackCode0
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis0
TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models0
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning StepsCode0
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMsCode0
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image DescriptionsCode2
Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images0
Data-augmented phrase-level alignment for mitigating object hallucination0
WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization0
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal DatasetsCode0
Artwork Explanation in Large-scale Vision Language Models0
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models0
Can Large Multimodal Models Uncover Deep Semantics Behind Images?Code1
Seeing the Unseen: Visual Common Sense for Semantic Placement0
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models0
Localized Symbolic Knowledge Distillation for Visual Commonsense ModelsCode0
Impressions: Understanding Visual Semiotics and Aesthetic Impact0
Large Language Models can Share Images, Too!Code0
Towards image compression with perfect realism at ultra-low bitratesCode1
Bounding and Filling: A Fast and Flexible Framework for Image CaptioningCode0
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learningCode7
ContextRef: Evaluating Referenceless Metrics For Image Description GenerationCode0
A skeletonization algorithm for gradient-based optimizationCode1
A Fine-Grained Image Description Generation Method Based on Joint Objectives0
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondCode5
Chatting Makes Perfect: Chat-based Image RetrievalCode1
PandaGPT: One Model To Instruction-Follow Them AllCode2
DiffCap: Exploring Continuous Diffusion on Image Captioning0
Caption Anything: Interactive Image Description with Diverse Multimodal ControlsCode3
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language ModelsCode7
Fan-Beam Binarization Difference Projection (FB-BDP): A Novel Local Object Descriptor for Fine-Grained Leaf Image RetrievalCode0
DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue DatasetCode1
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text GenerationCode1
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information RetrievalCode0
Facial Expression Recognition and Image Description Generation in Vietnamese0
Skeletal Human Action Recognition using Hybrid Attention based Graph Convolutional NetworkCode0
Image Description Dataset for Language Learners0
Multilingual Image Corpus – Towards a Multimodal and Multilingual Dataset0
Show:102550
← PrevPage 1 of 4Next →

No leaderboard results yet.