| HAMMR: HierArchical MultiModal React agents for generic VQA | Apr 8, 2024 | Optical Character Recognition (OCR)Question Answering | —Unverified | 0 |
| Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement | Apr 6, 2024 | Image-text Retrievalobject-detection | —Unverified | 0 |
| Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning | Apr 6, 2024 | Domain GeneralizationImage Retrieval | CodeCode Available | 0 |
| Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models | Apr 6, 2024 | MMEObject | CodeCode Available | 0 |
| BuDDIE: A Business Document Dataset for Multi-task Information Extraction | Apr 5, 2024 | Document Classificationdocument understanding | —Unverified | 0 |
| TinyVQA: Compact Multimodal Deep Neural Network for Visual Question Answering on Resource-Constrained Devices | Apr 4, 2024 | QuantizationQuestion Answering | —Unverified | 0 |
| Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns | Apr 3, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs | Apr 1, 2024 | Common Sense ReasoningObject | —Unverified | 0 |
| Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning | Apr 1, 2024 | Image CaptioningInstruction Following | CodeCode Available | 0 |
| Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training | Mar 30, 2024 | Contrastive LearningQuestion Answering | CodeCode Available | 0 |
| Uncovering Bias in Large Vision-Language Models with Counterfactuals | Mar 29, 2024 | counterfactualQuestion Answering | —Unverified | 0 |
| A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions | Mar 26, 2024 | Gaze Target EstimationQuestion Answering | —Unverified | 0 |
| Visual Hallucination: Definition, Quantification, and Prescriptive Remediations | Mar 26, 2024 | HallucinationImage Captioning | —Unverified | 0 |
| Intrinsic Subgraph Generation for Interpretable Graph based Visual Question Answering | Mar 26, 2024 | Decision MakingExplainable artificial intelligence | CodeCode Available | 0 |
| Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA | Mar 25, 2024 | Chart Question AnsweringData Augmentation | —Unverified | 0 |
| PropTest: Automatic Property Testing for Improved Visual Programming | Mar 25, 2024 | Question AnsweringReferring Expression | —Unverified | 0 |
| Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery | Mar 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MyVLM: Personalizing VLMs for User-Specific Queries | Mar 21, 2024 | Image CaptioningLanguage Modelling | —Unverified | 0 |
| VL-Mamba: Exploring State Space Models for Multimodal Learning | Mar 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Improved Baselines for Data-efficient Perceptual Augmentation of LLMs | Mar 20, 2024 | Audio captioningImage Captioning | —Unverified | 0 |
| As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks? | Mar 19, 2024 | Adversarial AttackImage Captioning | —Unverified | 0 |
| WoLF: Wide-scope Large Language Model Framework for CXR Understanding | Mar 19, 2024 | AnatomyInstruction Following | —Unverified | 0 |
| FlexCap: Describe Anything in Images in Controllable Detail | Mar 18, 2024 | AttributeDense Captioning | —Unverified | 0 |
| Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis | Mar 18, 2024 | In-Context LearningQuestion Answering | —Unverified | 0 |
| SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors | Mar 18, 2024 | HallucinationMotion Planning | —Unverified | 0 |