SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 2130 of 56 papers

TitleStatusHype
A survey on knowledge-enhanced multimodal learning0
I Can't Believe There's No Images! Learning Visual Tasks Using only Language SupervisionCode1
AlignVE: Visual Entailment Recognition Based on Alignment Relations0
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
Pre-training image-language transformers for open-vocabulary tasks0
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
Prompt Tuning for Generative Multimodal Pretrained ModelsCode0
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language ExplanationsCode0
MixGen: A New Multi-Modal Data AugmentationCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
Show:102550
← PrevPage 3 of 6Next →

No leaderboard results yet.