SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 110 of 56 papers

TitleStatusHype
Distilled Dual-Encoder Model for Vision-Language UnderstandingCode1
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven OptimizationCode1
Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart CaptioningCode1
Check It Again: Progressive Visual Question Answering via Visual EntailmentCode1
Check It Again:Progressive Visual Question Answering via Visual EntailmentCode1
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based SegmentationCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
Fine-Grained Visual EntailmentCode1
Show:102550
← PrevPage 1 of 6Next →

No leaderboard results yet.