Deep Equilibrium Multimodal Fusion Jun 29, 2023 Visual Question Answering (VQA)
— Unverified 0LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 2Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering Jun 29, 2023 Answer Generation Question Answering
Code Code Available 1Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering Jun 28, 2023 Passage Retrieval Question Answering
Code Code Available 0Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic Jun 27, 2023 Image Captioning Referring Expression Segmentation
Code Code Available 2Kosmos-2: Grounding Multimodal Large Language Models to the World Jun 26, 2023 Image Captioning In-Context Learning
Code Code Available 1Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Jun 26, 2023 Hallucination Visual Question Answering
Code Code Available 2FunQA: Towards Surprising Video Comprehension Jun 26, 2023 Question Answering Text Generation
Code Code Available 1Visual Question Answering in Remote Sensing with Cross-Attention and Multimodal Information Bottleneck Jun 25, 2023 object-detection Object Detection
— Unverified 0StarVQA+: Co-training Space-Time Attention for Video Quality Assessment Jun 21, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 0Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering Jun 16, 2023 Image Captioning Question Answering
Code Code Available 1Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories Jun 15, 2023 Question Answering Retrieval
— Unverified 0COSA: Concatenated Sample Pretrained Vision-Language Foundation Model Jun 15, 2023 Form model
Code Code Available 1Improving Selective Visual Question Answering by Learning from Your Peers Jun 14, 2023 Question Answering Visual Question Answering
Code Code Available 1Scalable Neural-Probabilistic Answer Set Programming Jun 14, 2023 Probabilistic Programming Question Answering
Code Code Available 1Visual Question Answering (VQA) on Images with Superimposed Text Jun 13, 2023 Question Answering Visual Question Answering
— Unverified 0AVIS: Autonomous Visual Information Seeking with Large Language Model Agent Jun 13, 2023 Decision Making Language Modeling
— Unverified 0Weakly Supervised Visual Question Answer Generation Jun 11, 2023 Answer Generation Dependency Parsing
— Unverified 0Modular Visual Question Answering via Code Generation Jun 8, 2023 Code Generation In-Context Learning
Code Code Available 1Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering Jun 8, 2023 Question Answering Retrieval
— Unverified 0Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards Jun 7, 2023 Diversity Image Captioning
Code Code Available 1Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! Jun 6, 2023 counterfactual Data Augmentation
Code Code Available 1Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes Jun 4, 2023 Common Sense Reasoning Question Answering
— Unverified 0DocFormerv2: Local Features for Document Understanding Jun 2, 2023 Decoder document understanding
Code Code Available 1MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models Jun 2, 2023 In-Context Learning Language Modeling
— Unverified 0Revisiting the Role of Language Priors in Vision-Language Models Jun 2, 2023 Image-text matching Image-text Retrieval
Code Code Available 1Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data Jun 1, 2023 Anomaly Detection Image Generation
— Unverified 0Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering Jun 1, 2023 Optical Character Recognition (OCR) Question Answering
Code Code Available 1LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing Jun 1, 2023 Question Answering Visual Question Answering
— Unverified 0End-to-end Knowledge Retrieval with Multi-modal Queries Jun 1, 2023 Benchmarking Cross-Modal Retrieval
Code Code Available 1Overcoming Language Bias in Remote Sensing Visual Question Answering via Adversarial Training Jun 1, 2023 Question Answering Visual Question Answering
— Unverified 0Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models May 31, 2023 Question Answering Visual Question Answering
— Unverified 0Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA May 31, 2023 counterfactual Counterfactual Inference
— Unverified 0Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge May 30, 2023 Answer Selection Question Answering
— Unverified 0VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset May 29, 2023 Audio captioning Audio-Visual Captioning
Code Code Available 2PaLI-X: On Scaling up a Multilingual Vision and Language Model May 29, 2023 Chart Question Answering document understanding
Code Code Available 1HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language May 28, 2023 Machine Translation Multimodal Machine Translation
Code Code Available 0CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers May 27, 2023 Image Captioning Image Retrieval
Code Code Available 1Modularized Zero-shot VQA with Pre-trained Models May 27, 2023 object-detection Object Detection
Code Code Available 0Study of Subjective and Objective Quality Assessment of Mobile Cloud Gaming Videos May 26, 2023 Video Quality Assessment Visual Question Answering (VQA)
— Unverified 0Zero-shot Visual Question Answering with Language Model Feedback May 26, 2023 Language Modeling Language Modelling
Code Code Available 0Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering May 24, 2023 Question Answering Visual Question Answering
— Unverified 0NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario May 24, 2023 Autonomous Driving Question Answering
Code Code Available 2Measuring Faithful and Plausible Visual Grounding in VQA May 24, 2023 Question Answering Visual Grounding
Code Code Available 0Transferring Visual Attributes from Natural Language to Verified Image Generation May 24, 2023 Image Generation Text to Image Generation
— Unverified 0Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach May 23, 2023 Image Manipulation Question Answering
— Unverified 0DUBLIN -- Document Understanding By Language-Image Network May 23, 2023 Document Classification document understanding
— Unverified 0Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design May 22, 2023 image-classification Image Classification
— Unverified 0VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending May 22, 2023 Question Answering Retrieval
— Unverified 0Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach May 22, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1