Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning Apr 7, 2021 Representation Learning Retrieval
Code Code Available 15 Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift Dec 15, 2022 Benchmarking Image Captioning
Code Code Available 15 Check It Again: Progressive Visual Question Answering via Visual Entailment Jun 8, 2021 Question Answering Visual Entailment
Code Code Available 15 Check It Again:Progressive Visual Question Answering via Visual Entailment Aug 1, 2021 Question Answering Visual Entailment
Code Code Available 15 CoCa: Contrastive Captioners are Image-Text Foundation Models May 4, 2022 Action Classification Decoder
Code Code Available 15 Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization Dec 19, 2024 Contrastive Learning Decision Making
Code Code Available 15 Distilled Dual-Encoder Model for Vision-Language Understanding Dec 16, 2021 Image to text model
Code Code Available 15 Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning Dec 15, 2023 Factual Inconsistency Detection in Chart Captioning Image Captioning
Code Code Available 15 Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Aug 29, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 15 Fine-Grained Visual Entailment Mar 29, 2022 Multimodal Reasoning Visual Entailment
Code Code Available 15 Good Questions Help Zero-Shot Image Reasoning Dec 4, 2023 Fine-Grained Image Classification Question Answering
Code Code Available 15 Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations Dec 8, 2022 Explanation Generation Visual Entailment
Code Code Available 15 How Much Can CLIP Benefit Vision-and-Language Tasks? Jul 13, 2021 Question Answering Vision and Language Navigation
Code Code Available 15 I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision Nov 17, 2022 Image Captioning Question Answering
Code Code Available 15 I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors May 24, 2023 Visual Entailment
Code Code Available 15 Large-Scale Adversarial Training for Vision-and-Language Representation Learning Jun 11, 2020 Image-text Retrieval Question Answering
Code Code Available 15 LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition Feb 15, 2024 Grounded Multimodal Named Entity Recognition Multi-modal Named Entity Recognition
Code Code Available 15 MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model Oct 11, 2022 Contrastive Learning Image-text matching
Code Code Available 15 MixGen: A New Multi-Modal Data Augmentation Jun 16, 2022 Data Augmentation Image-text Retrieval
Code Code Available 15 MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion Mar 14, 2024 Disentanglement Multimodal Deep Learning
Code Code Available 15 NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks Mar 9, 2022 Decision Making Explainable artificial intelligence
Code Code Available 15 Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation Jun 11, 2024 Grounded Multimodal Named Entity Recognition named-entity-recognition
Code Code Available 15 UNITER: UNiversal Image-TExt Representation Learning Sep 25, 2019 Image-text matching Image-text Retrieval
Code Code Available 15 Understanding Figurative Meaning through Explainable Visual Entailment May 2, 2024 Question Answering Visual Entailment
Code Code Available 15 Visual Spatial Reasoning Apr 30, 2022 Spatial Reasoning
Code Code Available 15 VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing Mar 5, 2024 Multimodal Reasoning Sentence
Code Code Available 05 Prompt Tuning for Generative Multimodal Pretrained Models Aug 4, 2022 Image Captioning Visual Entailment
Code Code Available 05 Visual Entailment: A Novel Task for Fine-Grained Image Understanding Jan 20, 2019 Natural Language Inference Question Answering
Code Code Available 05 Visual Entailment Task for Visually-Grounded Language Learning Nov 26, 2018 Grounded language learning Natural Language Inference
Code Code Available 05 Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages Jun 29, 2023 Image-text Retrieval Machine Translation
Code Code Available 05 p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models Dec 17, 2023 Image Captioning Question Answering
Code Code Available 05 Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations Jul 23, 2022 Decision Making Explanation Generation
Code Code Available 05 OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework Feb 7, 2022 Image Captioning image-classification
Code Code Available 05 ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks Feb 27, 2024 Domain Generalization Image Captioning
— Unverified 00 A survey on knowledge-enhanced multimodal learning Nov 19, 2022 Conditional Image Generation Factual Visual Question Answering
— Unverified 00 Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks Apr 22, 2022 Question Answering Visual Commonsense Reasoning
— Unverified 00 VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks Jul 29, 2024 Deep Learning Domain Generalization
— Unverified 00 CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment Mar 14, 2022 parameter-efficient fine-tuning Question Answering
— Unverified 00 CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks Jan 15, 2022 Question Answering Visual Commonsense Reasoning
— Unverified 00 Compound Tokens: Channel Fusion for Vision-Language Representation Learning Dec 2, 2022 Decoder Language Modeling
— Unverified 00 Playing Lottery Tickets with Vision and Language Apr 23, 2021 Image-text Retrieval Question Answering
— Unverified 00 Pre-training image-language transformers for open-vocabulary tasks Sep 9, 2022 Question Answering Visual Entailment
— Unverified 00 Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training Jun 25, 2021 Image-text Retrieval Question Answering
— Unverified 00 Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training May 21, 2021 Question Answering Relation
— Unverified 00 Few-shot Multimodal Multitask Multilingual Learning Feb 19, 2023 Few-Shot Learning In-Context Learning
— Unverified 00 Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing Sep 27, 2015 Natural Language Understanding Object Recognition
— Unverified 00 How Much Can CLIP Benefit Vision-and-Language Tasks? Sep 29, 2021 Question Answering Visual Entailment
— Unverified 00 Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning Mar 10, 2023 Few-Shot Image Classification image-classification
— Unverified 00 Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation Dec 10, 2021 Image-text matching Image-text Retrieval
— Unverified 00 "Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning Jun 1, 2023 Image Captioning Keyword Extraction
— Unverified 00