Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning Apr 7, 2021 Representation Learning Retrieval
Code Code Available 1Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift Dec 15, 2022 Benchmarking Image Captioning
Code Code Available 1Check It Again: Progressive Visual Question Answering via Visual Entailment Jun 8, 2021 Question Answering Visual Entailment
Code Code Available 1Check It Again:Progressive Visual Question Answering via Visual Entailment Aug 1, 2021 Question Answering Visual Entailment
Code Code Available 1CoCa: Contrastive Captioners are Image-Text Foundation Models May 4, 2022 Action Classification Decoder
Code Code Available 1Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization Dec 19, 2024 Contrastive Learning Decision Making
Code Code Available 1Distilled Dual-Encoder Model for Vision-Language Understanding Dec 16, 2021 Image to text model
Code Code Available 1Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning Dec 15, 2023 Factual Inconsistency Detection in Chart Captioning Image Captioning
Code Code Available 1Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Aug 29, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 1Fine-Grained Visual Entailment Mar 29, 2022 Multimodal Reasoning Visual Entailment
Code Code Available 1Good Questions Help Zero-Shot Image Reasoning Dec 4, 2023 Fine-Grained Image Classification Question Answering
Code Code Available 1Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations Dec 8, 2022 Explanation Generation Visual Entailment
Code Code Available 1How Much Can CLIP Benefit Vision-and-Language Tasks? Jul 13, 2021 Question Answering Vision and Language Navigation
Code Code Available 1I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision Nov 17, 2022 Image Captioning Question Answering
Code Code Available 1I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors May 24, 2023 Visual Entailment
Code Code Available 1Large-Scale Adversarial Training for Vision-and-Language Representation Learning Jun 11, 2020 Image-text Retrieval Question Answering
Code Code Available 1LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition Feb 15, 2024 Grounded Multimodal Named Entity Recognition Multi-modal Named Entity Recognition
Code Code Available 1MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model Oct 11, 2022 Contrastive Learning Image-text matching
Code Code Available 1MixGen: A New Multi-Modal Data Augmentation Jun 16, 2022 Data Augmentation Image-text Retrieval
Code Code Available 1MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion Mar 14, 2024 Disentanglement Multimodal Deep Learning
Code Code Available 1NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks Mar 9, 2022 Decision Making Explainable artificial intelligence
Code Code Available 1Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation Jun 11, 2024 Grounded Multimodal Named Entity Recognition named-entity-recognition
Code Code Available 1UNITER: UNiversal Image-TExt Representation Learning Sep 25, 2019 Image-text matching Image-text Retrieval
Code Code Available 1Understanding Figurative Meaning through Explainable Visual Entailment May 2, 2024 Question Answering Visual Entailment
Code Code Available 1Visual Spatial Reasoning Apr 30, 2022 Spatial Reasoning
Code Code Available 1Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation Dec 10, 2021 Image-text matching Image-text Retrieval
— Unverified 0"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning Jun 1, 2023 Image Captioning Keyword Extraction
— Unverified 0Lightweight In-Context Tuning for Multimodal Unified Models Oct 8, 2023 Image Captioning In-Context Learning
— Unverified 0UNITER: Learning UNiversal Image-TExt Representations Sep 25, 2019 Image-text matching Image-text Retrieval
— Unverified 0Logically at Factify 2022: Multimodal Fact Verification Dec 16, 2021 Benchmarking Fact Checking
— Unverified 0Playing Lottery Tickets with Vision and Language Apr 23, 2021 Image-text Retrieval Question Answering
— Unverified 0Pre-training image-language transformers for open-vocabulary tasks Sep 9, 2022 Question Answering Visual Entailment
— Unverified 0Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training Jun 25, 2021 Image-text Retrieval Question Answering
— Unverified 0Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training May 21, 2021 Question Answering Relation
— Unverified 0Few-shot Multimodal Multitask Multilingual Learning Feb 19, 2023 Few-Shot Learning In-Context Learning
— Unverified 0Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing Sep 27, 2015 Natural Language Understanding Object Recognition
— Unverified 0How Much Can CLIP Benefit Vision-and-Language Tasks? Sep 29, 2021 Question Answering Visual Entailment
— Unverified 0Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning Mar 10, 2023 Few-Shot Image Classification image-classification
— Unverified 0Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment Mar 1, 2022 Retrieval Sentence
— Unverified 0AlignVE: Visual Entailment Recognition Based on Alignment Relations Nov 16, 2022 Question Answering Relation
— Unverified 0Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering May 2, 2022 Decoder Image Captioning
— Unverified 0ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks Feb 27, 2024 Domain Generalization Image Captioning
— Unverified 0A survey on knowledge-enhanced multimodal learning Nov 19, 2022 Conditional Image Generation Factual Visual Question Answering
— Unverified 0Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks Apr 22, 2022 Question Answering Visual Commonsense Reasoning
— Unverified 0VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks Jul 29, 2024 Deep Learning Domain Generalization
— Unverified 0CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment Mar 14, 2022 parameter-efficient fine-tuning Question Answering
— Unverified 0CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks Jan 15, 2022 Question Answering Visual Commonsense Reasoning
— Unverified 0Compound Tokens: Channel Fusion for Vision-Language Representation Learning Dec 2, 2022 Decoder Language Modeling
— Unverified 0Visual Entailment Task for Visually-Grounded Language Learning Nov 26, 2018 Grounded language learning Natural Language Inference
Code Code Available 0Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages Jun 29, 2023 Image-text Retrieval Machine Translation
Code Code Available 0