SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 1120 of 56 papers

TitleStatusHype
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven OptimizationCode1
Distilled Dual-Encoder Model for Vision-Language UnderstandingCode1
Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart CaptioningCode1
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based SegmentationCode1
Fine-Grained Visual EntailmentCode1
Good Questions Help Zero-Shot Image ReasoningCode1
MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal FusionCode1
Show:102550
← PrevPage 2 of 6Next →

No leaderboard results yet.