SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 1120 of 56 papers

TitleStatusHype
Good Questions Help Zero-Shot Image ReasoningCode1
Lightweight In-Context Tuning for Multimodal Unified Models0
Stop Pre-Training: Adapt Visual-Language Models to Unseen LanguagesCode0
"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning0
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual MetaphorsCode1
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning0
Few-shot Multimodal Multitask Multilingual Learning0
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language ExplanationsCode1
Compound Tokens: Channel Fusion for Vision-Language Representation Learning0
Show:102550
← PrevPage 2 of 6Next →

No leaderboard results yet.