SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 3140 of 56 papers

TitleStatusHype
Stop Pre-Training: Adapt Visual-Language Models to Unseen LanguagesCode0
"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning0
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning0
Few-shot Multimodal Multitask Multilingual Learning0
Compound Tokens: Channel Fusion for Vision-Language Representation Learning0
A survey on knowledge-enhanced multimodal learning0
AlignVE: Visual Entailment Recognition Based on Alignment Relations0
Pre-training image-language transformers for open-vocabulary tasks0
Prompt Tuning for Generative Multimodal Pretrained ModelsCode0
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language ExplanationsCode0
Show:102550
← PrevPage 4 of 6Next →

No leaderboard results yet.