SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 2130 of 56 papers

TitleStatusHype
How Much Can CLIP Benefit Vision-and-Language Tasks?Code1
Check It Again: Progressive Visual Question Answering via Visual EntailmentCode1
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
UNITER: UNiversal Image-TExt Representation LearningCode1
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks0
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint ErasingCode0
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks0
p-Laplacian Adaptation for Generative Pre-trained Vision-Language ModelsCode0
Lightweight In-Context Tuning for Multimodal Unified Models0
Show:102550
← PrevPage 3 of 6Next →

No leaderboard results yet.