SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 125 of 56 papers

TitleStatusHype
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningCode1
Understanding Figurative Meaning through Explainable Visual EntailmentCode1
MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal FusionCode1
MixGen: A New Multi-Modal Data AugmentationCode1
UNITER: UNiversal Image-TExt Representation LearningCode1
Check It Again: Progressive Visual Question Answering via Visual EntailmentCode1
Check It Again:Progressive Visual Question Answering via Visual EntailmentCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language TasksCode1
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven OptimizationCode1
Distilled Dual-Encoder Model for Vision-Language UnderstandingCode1
Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart CaptioningCode1
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based SegmentationCode1
Fine-Grained Visual EntailmentCode1
Good Questions Help Zero-Shot Image ReasoningCode1
Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language ExplanationsCode1
How Much Can CLIP Benefit Vision-and-Language Tasks?Code1
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity RecognitionCode1
I Can't Believe There's No Images! Learning Visual Tasks Using only Language SupervisionCode1
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual MetaphorsCode1
Visual Spatial ReasoningCode1
Show:102550
← PrevPage 1 of 3Next →

No leaderboard results yet.