SOTAVerified

Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Papers

Showing 150 of 56 papers

TitleStatusHype
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven OptimizationCode1
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based SegmentationCode1
Understanding Figurative Meaning through Explainable Visual EntailmentCode1
MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal FusionCode1
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity RecognitionCode1
Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart CaptioningCode1
Good Questions Help Zero-Shot Image ReasoningCode1
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual MetaphorsCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language ExplanationsCode1
I Can't Believe There's No Images! Learning Visual Tasks Using only Language SupervisionCode1
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
MixGen: A New Multi-Modal Data AugmentationCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
Visual Spatial ReasoningCode1
Fine-Grained Visual EntailmentCode1
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language TasksCode1
Distilled Dual-Encoder Model for Vision-Language UnderstandingCode1
Check It Again:Progressive Visual Question Answering via Visual EntailmentCode1
How Much Can CLIP Benefit Vision-and-Language Tasks?Code1
Check It Again: Progressive Visual Question Answering via Visual EntailmentCode1
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
UNITER: UNiversal Image-TExt Representation LearningCode1
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks0
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint ErasingCode0
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks0
p-Laplacian Adaptation for Generative Pre-trained Vision-Language ModelsCode0
Lightweight In-Context Tuning for Multimodal Unified Models0
Stop Pre-Training: Adapt Visual-Language Models to Unseen LanguagesCode0
"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning0
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning0
Few-shot Multimodal Multitask Multilingual Learning0
Compound Tokens: Channel Fusion for Vision-Language Representation Learning0
A survey on knowledge-enhanced multimodal learning0
AlignVE: Visual Entailment Recognition Based on Alignment Relations0
Pre-training image-language transformers for open-vocabulary tasks0
Prompt Tuning for Generative Multimodal Pretrained ModelsCode0
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language ExplanationsCode0
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering0
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks0
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment0
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment0
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkCode0
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks0
Logically at Factify 2022: Multimodal Fact Verification0
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation0
How Much Can CLIP Benefit Vision-and-Language Tasks?0
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.