SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 2130 of 56 papers

TitleStatusHype
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language TasksCode1
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based SegmentationCode1
UNITER: UNiversal Image-TExt Representation LearningCode1
Understanding Figurative Meaning through Explainable Visual EntailmentCode1
Visual Spatial ReasoningCode1
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation0
"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning0
Lightweight In-Context Tuning for Multimodal Unified Models0
UNITER: Learning UNiversal Image-TExt Representations0
Logically at Factify 2022: Multimodal Fact Verification0
Show:102550
← PrevPage 3 of 6Next →

No leaderboard results yet.