| Explaining Autonomous Driving Actions with Visual Question Answering | Jul 19, 2023 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| A reinforcement learning approach for VQA validation: an application to diabetic macular edema grading | Jul 19, 2023 | Medical Image AnalysisQuestion Answering | —Unverified | 0 |
| Generative Visual Question Answering | Jul 18, 2023 | Generative Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving | Jul 18, 2023 | Autonomous DrivingModel Selection | CodeCode Available | 0 |
| Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation | Jul 18, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese | Jul 17, 2023 | Question AnsweringVietnamese Visual Question Answering | —Unverified | 0 |
| A scoping review on multimodal deep learning in biomedical images and texts | Jul 14, 2023 | Cross-Modal RetrievalDecision Making | —Unverified | 0 |
| MMBench: Is Your Multi-modal Model an All-around Player? | Jul 12, 2023 | AllInstruction Following | CodeCode Available | 5 |
| Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting | Jul 11, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery | Jul 11, 2023 | Question AnsweringScene Understanding | CodeCode Available | 1 |
| Emu: Generative Pretraining in Multimodality | Jul 11, 2023 | Image CaptioningImage Generation | CodeCode Available | 3 |
| Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | Jul 9, 2023 | Question AnsweringTGIF-Frame | CodeCode Available | 1 |
| GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | Jul 7, 2023 | AttributeCommon Sense Reasoning | CodeCode Available | 2 |
| Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning | Jul 6, 2023 | Knowledge GraphsQuestion Answering | —Unverified | 0 |
| UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering | Jul 6, 2023 | DiagnosticImage Enhancement | —Unverified | 0 |
| JourneyDB: A Benchmark for Generative Image Understanding | Jul 3, 2023 | Image CaptioningImage Comprehension | CodeCode Available | 2 |
| Localized Questions in Medical Visual Question Answering | Jul 3, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Multimodal Prompt Retrieval for Generative Visual Question Answering | Jun 30, 2023 | Domain AdaptationGenerative Visual Question Answering | CodeCode Available | 1 |
| Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering | Jun 29, 2023 | Answer GenerationQuestion Answering | CodeCode Available | 1 |
| Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering | Jun 28, 2023 | Passage RetrievalQuestion Answering | CodeCode Available | 0 |
| Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic | Jun 27, 2023 | Image CaptioningReferring Expression Segmentation | CodeCode Available | 2 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Jun 26, 2023 | HallucinationVisual Question Answering | CodeCode Available | 2 |
| Visual Question Answering in Remote Sensing with Cross-Attention and Multimodal Information Bottleneck | Jun 25, 2023 | object-detectionObject Detection | —Unverified | 0 |
| Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input | Jun 25, 2023 | DiversityImage-text Retrieval | —Unverified | 0 |
| TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter | Jun 22, 2023 | Question AnsweringRetrieval | CodeCode Available | 0 |
| Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering | Jun 16, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories | Jun 15, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | Jun 15, 2023 | HallucinationImage Captioning | CodeCode Available | 2 |
| Improving Selective Visual Question Answering by Learning from Your Peers | Jun 14, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Scalable Neural-Probabilistic Answer Set Programming | Jun 14, 2023 | Probabilistic ProgrammingQuestion Answering | CodeCode Available | 1 |
| Visual Question Answering (VQA) on Images with Superimposed Text | Jun 13, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training | Jun 13, 2023 | image-classificationImage Classification | CodeCode Available | 0 |
| AVIS: Autonomous Visual Information Seeking with Large Language Model Agent | Jun 13, 2023 | Decision MakingLanguage Modeling | —Unverified | 0 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation | Jun 12, 2023 | Image CaptioningMachine Translation | —Unverified | 0 |
| Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark | Jun 10, 2023 | Image-text RetrievalMedical Report Generation | CodeCode Available | 1 |
| Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering | Jun 8, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Modular Visual Question Answering via Code Generation | Jun 8, 2023 | Code GenerationIn-Context Learning | CodeCode Available | 1 |
| MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Jun 8, 2023 | In-Context LearningVisual Question Answering | CodeCode Available | 4 |
| Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! | Jun 6, 2023 | counterfactualData Augmentation | CodeCode Available | 1 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| An Approach to Solving the Abstraction and Reasoning Corpus (ARC) Challenge | Jun 6, 2023 | ARCQuestion Answering | CodeCode Available | 1 |
| Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes | Jun 4, 2023 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| Revisiting the Role of Language Priors in Vision-Language Models | Jun 2, 2023 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Jun 1, 2023 | Image ClassificationInstruction Following | CodeCode Available | 4 |
| Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data | Jun 1, 2023 | Anomaly DetectionImage Generation | —Unverified | 0 |
| Overcoming Language Bias in Remote Sensing Visual Question Answering via Adversarial Training | Jun 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing | Jun 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models | May 31, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |