SOTAVerified

Visual Commonsense Reasoning

Papers

Showing 150 of 65 papers

TitleStatusHype
GPT4RoI: Instruction Tuning Large Language Model on Region-of-InterestCode2
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language ModelsCode2
All in One: Exploring Unified Video-Language Pre-trainingCode2
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringCode2
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
MERLOT: Multimodal Neural Script Knowledge ModelsCode1
Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement LearningCode1
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense GraphsCode1
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsCode1
A Survey on Interpretable Cross-modal ReasoningCode1
Improving Visual Commonsense in Language Models via Multiple Image GenerationCode1
Unifying Vision-and-Language Tasks via Text GenerationCode1
UNITER: UNiversal Image-TExt Representation LearningCode1
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksCode1
VL-BERT: Pre-training of Generic Visual-Linguistic RepresentationsCode1
Towards artificial general intelligence via a multimodal foundation modelCode1
X-modaler: A Versatile and High-performance Codebase for Cross-modal AnalyticsCode1
Broaden the Vision: Geo-Diverse Visual Commonsense ReasoningCode1
Think Visually: Question Answering through Virtual ImageryCode0
From Recognition to Cognition: Visual Commonsense ReasoningCode0
Interpretable Visual Understanding with Cognitive Attention NetworkCode0
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language TransformersCode0
Compositional Image-Text Matching and Retrieval by Grounding EntitiesCode0
Cognitive Visual Commonsense Reasoning Using Dynamic Working MemoryCode0
Connective Cognition Network for Directional Visual Commonsense ReasoningCode0
Joint Answering and Explanation for Visual Commonsense ReasoningCode0
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual PromptsCode0
VASR: Visual Analogies of Situation RecognitionCode0
Heterogeneous Graph Learning for Visual Commonsense ReasoningCode0
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning DistractorCode0
ILLUME: Rationalizing Vision-Language Models through Human InteractionsCode0
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning BaselinesCode0
TAB-VCR: Tags and Attributes based VCR BaselinesCode0
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks0
A survey on knowledge-enhanced multimodal learning0
Attention Mechanism based Cognition-level Scene Understanding0
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images0
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language0
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks0
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?0
Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning0
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR0
Enforcing Reasoning in Visual Commonsense Reasoning0
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning0
Fusion of Detected Objects in Text for Visual Question Answering0
Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing0
GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions0
How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey0
Improving Vision-and-Language Reasoning via Spatial Relations Modeling0
InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.