SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 4150 of 56 papers

TitleStatusHype
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering0
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks0
A survey on knowledge-enhanced multimodal learning0
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks0
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks0
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment0
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks0
Compound Tokens: Channel Fusion for Vision-Language Representation Learning0
Visual Entailment Task for Visually-Grounded Language LearningCode0
Stop Pre-Training: Adapt Visual-Language Models to Unseen LanguagesCode0
Show:102550
← PrevPage 5 of 6Next →

No leaderboard results yet.