| Separate and Locate: Rethink the Text in Text-based Visual Question Answering | Aug 31, 2023 | Optical Character Recognition (OCR)Position | CodeCode Available | 0 |
| Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception | Aug 31, 2023 | Activity RecognitionHuman Activity Recognition | —Unverified | 0 |
| DLIP: Distilling Language-Image Pre-training | Aug 24, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE | Aug 23, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes | Aug 21, 2023 | AttributeQuestion Answering | —Unverified | 0 |
| VQA Therapy: Exploring Answer Differences by Visually Grounding Answers | Aug 21, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Generic Attention-model Explainability by Weighted Relevance Accumulation | Aug 20, 2023 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models | Aug 18, 2023 | Image-text matchingObject Localization | —Unverified | 0 |
| Learning the meanings of function words from grounded language using a visual question answering model | Aug 16, 2023 | Logical ReasoningQuestion Answering | CodeCode Available | 0 |
| TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models | Aug 7, 2023 | backdoor defenseobject-detection | CodeCode Available | 0 |
| RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic | Aug 3, 2023 | Chart Question AnsweringFormal Logic | CodeCode Available | 0 |
| ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders | Aug 2, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering | Jul 28, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering | Jul 28, 2023 | Question AnsweringVietnamese Visual Question Answering | —Unverified | 0 |
| LOIS: Looking Out of Instance Semantics for Visual Question Answering | Jul 26, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Robust Visual Question Answering: Datasets, Methods, and Future Challenges | Jul 21, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A reinforcement learning approach for VQA validation: an application to diabetic macular edema grading | Jul 19, 2023 | Medical Image AnalysisQuestion Answering | —Unverified | 0 |
| Generative Visual Question Answering | Jul 18, 2023 | Generative Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation | Jul 18, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving | Jul 18, 2023 | Autonomous DrivingModel Selection | CodeCode Available | 0 |
| PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese | Jul 17, 2023 | Question AnsweringVietnamese Visual Question Answering | —Unverified | 0 |
| A scoping review on multimodal deep learning in biomedical images and texts | Jul 14, 2023 | Cross-Modal RetrievalDecision Making | —Unverified | 0 |
| Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning | Jul 6, 2023 | Knowledge GraphsQuestion Answering | —Unverified | 0 |
| UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering | Jul 6, 2023 | DiagnosticImage Enhancement | —Unverified | 0 |
| Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering | Jun 28, 2023 | Passage RetrievalQuestion Answering | CodeCode Available | 0 |
| Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input | Jun 25, 2023 | DiversityImage-text Retrieval | —Unverified | 0 |
| Visual Question Answering in Remote Sensing with Cross-Attention and Multimodal Information Bottleneck | Jun 25, 2023 | object-detectionObject Detection | —Unverified | 0 |
| TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter | Jun 22, 2023 | Question AnsweringRetrieval | CodeCode Available | 0 |
| Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories | Jun 15, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| AVIS: Autonomous Visual Information Seeking with Large Language Model Agent | Jun 13, 2023 | Decision MakingLanguage Modeling | —Unverified | 0 |
| Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training | Jun 13, 2023 | image-classificationImage Classification | CodeCode Available | 0 |
| Visual Question Answering (VQA) on Images with Superimposed Text | Jun 13, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation | Jun 12, 2023 | Image CaptioningMachine Translation | —Unverified | 0 |
| Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering | Jun 8, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes | Jun 4, 2023 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing | Jun 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data | Jun 1, 2023 | Anomaly DetectionImage Generation | —Unverified | 0 |
| Overcoming Language Bias in Remote Sensing Visual Question Answering via Adversarial Training | Jun 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA | May 31, 2023 | counterfactualCounterfactual Inference | —Unverified | 0 |
| Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models | May 31, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge | May 30, 2023 | Answer SelectionQuestion Answering | —Unverified | 0 |
| HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language | May 28, 2023 | Machine TranslationMultimodal Machine Translation | CodeCode Available | 0 |
| Modularized Zero-shot VQA with Pre-trained Models | May 27, 2023 | object-detectionObject Detection | CodeCode Available | 0 |
| Zero-shot Visual Question Answering with Language Model Feedback | May 26, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Mindstorms in Natural Language-Based Societies of Mind | May 26, 2023 | 3D GenerationImage Captioning | —Unverified | 0 |
| GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions | May 24, 2023 | ObjectQuestion Answering | —Unverified | 0 |
| Measuring Faithful and Plausible Visual Grounding in VQA | May 24, 2023 | Question AnsweringVisual Grounding | CodeCode Available | 0 |
| EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | May 24, 2023 | Image CaptioningLanguage Modelling | —Unverified | 0 |
| Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering | May 24, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |