SOTAVerified

Visual Commonsense Reasoning

Papers

Showing 150 of 65 papers

TitleStatusHype
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language ModelsCode2
GPT4RoI: Instruction Tuning Large Language Model on Region-of-InterestCode2
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringCode2
All in One: Exploring Unified Video-Language Pre-trainingCode2
Broaden the Vision: Geo-Diverse Visual Commonsense ReasoningCode1
MERLOT: Multimodal Neural Script Knowledge ModelsCode1
Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement LearningCode1
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense GraphsCode1
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsCode1
A Survey on Interpretable Cross-modal ReasoningCode1
Unifying Vision-and-Language Tasks via Text GenerationCode1
Improving Visual Commonsense in Language Models via Multiple Image GenerationCode1
UNITER: UNiversal Image-TExt Representation LearningCode1
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksCode1
VL-BERT: Pre-training of Generic Visual-Linguistic RepresentationsCode1
Towards artificial general intelligence via a multimodal foundation modelCode1
X-modaler: A Versatile and High-performance Codebase for Cross-modal AnalyticsCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
ILLUME: Rationalizing Vision-Language Models through Human InteractionsCode0
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual PromptsCode0
Interpretable Visual Understanding with Cognitive Attention NetworkCode0
Cognitive Visual Commonsense Reasoning Using Dynamic Working MemoryCode0
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language TransformersCode0
VASR: Visual Analogies of Situation RecognitionCode0
Fusion of Detected Objects in Text for Visual Question AnsweringCode0
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning DistractorCode0
Compositional Image-Text Matching and Retrieval by Grounding EntitiesCode0
From Recognition to Cognition: Visual Commonsense ReasoningCode0
Connective Cognition Network for Directional Visual Commonsense ReasoningCode0
Heterogeneous Graph Learning for Visual Commonsense ReasoningCode0
Joint Answering and Explanation for Visual Commonsense ReasoningCode0
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning BaselinesCode0
TAB-VCR: Tags and Attributes based VCR BaselinesCode0
Think Visually: Question Answering through Virtual ImageryCode0
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks0
A survey on knowledge-enhanced multimodal learning0
Attention Mechanism based Cognition-level Scene Understanding0
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images0
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language0
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks0
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?0
Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning0
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR0
Enforcing Reasoning in Visual Commonsense Reasoning0
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning0
Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing0
GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions0
How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey0
Improving Vision-and-Language Reasoning via Spatial Relations Modeling0
InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.