AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation Mar 20, 2024 Image Generation Text to Image Generation
— Unverified 0WoLF: Wide-scope Large Language Model Framework for CXR Understanding Mar 19, 2024 Anatomy Instruction Following
— Unverified 0VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning Mar 19, 2024 Benchmarking Image Captioning
Code Code Available 2HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning Mar 19, 2024 Reinforcement Learning (RL) Visual Grounding
Code Code Available 1SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors Mar 18, 2024 Hallucination Motion Planning
— Unverified 0FlexCap: Describe Anything in Images in Controllable Detail Mar 18, 2024 Attribute Dense Captioning
— Unverified 0PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset Mar 17, 2024 Attribute Common Sense Reasoning
Code Code Available 1Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches Mar 17, 2024 Image Captioning Question Answering
— Unverified 0Knowledge Condensation and Reasoning for Knowledge-based VQA Mar 15, 2024 Question Answering Reading Comprehension
— Unverified 0Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning Mar 15, 2024 Hallucination Instruction Following
— Unverified 0Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models Mar 15, 2024 Few-Shot Image Classification image-classification
— Unverified 0UniCode: Learning a Unified Codebook for Multimodal Large Language Models Mar 14, 2024 Quantization Visual Question Answering (VQA)
— Unverified 0Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering Mar 14, 2024 Optical Character Recognition Optical Character Recognition (OCR)
Code Code Available 0Multi-modal Auto-regressive Modeling via Visual Words Mar 12, 2024 Visual Question Answering Visual Question Answering (VQA)
Code Code Available 1Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models Mar 12, 2024 Concept Alignment Instruction Following
Code Code Available 1OmniCount: Multi-label Object Counting with Semantic-Geometric Priors Mar 8, 2024 Object Object Counting
— Unverified 0TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document Mar 7, 2024 document understanding Key Information Extraction
Code Code Available 5SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM Mar 7, 2024 Question Answering Retrieval
— Unverified 0CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments Mar 5, 2024 Language Modelling Large Language Model
— Unverified 0Enhancing Generalization in Medical Visual Question Answering Tasks via Gradient-Guided Model Perturbation Mar 5, 2024 Data Augmentation Medical Visual Question Answering
— Unverified 0Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review Mar 4, 2024 Medical Report Generation Question Answering
Code Code Available 3ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks Feb 27, 2024 Domain Generalization Image Captioning
— Unverified 0LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery Feb 26, 2024 Continual Learning Exemplar-Free
Code Code Available 0Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA Feb 24, 2024 3D Question Answering (3D-QA) Question Answering
Code Code Available 1CommVQA: Situating Visual Question Answering in Communicative Contexts Feb 22, 2024 Question Answering Visual Question Answering
Code Code Available 0Uncertainty-Aware Evaluation for Vision-Language Models Feb 22, 2024 Conformal Prediction Language Modeling
Code Code Available 1Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment Feb 21, 2024 Language Modelling Question Answering
Code Code Available 1CoLLaVO: Crayon Large Language and Vision mOdel Feb 17, 2024 Large Language Model model
Code Code Available 2A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models Feb 17, 2024 Diagnostic Visual Question Answering (VQA)
— Unverified 0II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering Feb 16, 2024 Question Answering Triplet
Code Code Available 0VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models Feb 16, 2024 Adversarial Robustness Language Modelling
— Unverified 0Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models Feb 16, 2024 Diversity Instruction Following
Code Code Available 1PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter Feb 16, 2024 Language Modeling Language Modelling
— Unverified 0LAPDoc: Layout-Aware Prompting for Documents Feb 15, 2024 document understanding Key Information Extraction
— Unverified 0Prompt-based Personalized Federated Learning for Medical Visual Question Answering Feb 15, 2024 Federated Learning Medical Visual Question Answering
— Unverified 0Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays Feb 14, 2024 Language Modeling Language Modelling
Code Code Available 0OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM Feb 14, 2024 Medical Visual Question Answering Question Answering
Code Code Available 4Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks Feb 13, 2024 Language Modeling Language Modelling
— Unverified 0PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers Feb 13, 2024 Question Answering Retrieval
Code Code Available 3KVQ: Kwai Video Quality Assessment for Short-form Videos Feb 11, 2024 Form Video Quality Assessment
Code Code Available 2Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy Feb 11, 2024 Language Modeling Open Vocabulary Attribute Detection
Code Code Available 1Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations Feb 10, 2024 Diagnostic Hallucination
Code Code Available 1CIC: A Framework for Culturally-Aware Image Captioning Feb 8, 2024 Descriptive Image Captioning
— Unverified 0ScreenAI: A Vision-Language Model for UI and Infographics Understanding Feb 7, 2024 Chart Question Answering Language Modeling
Code Code Available 2Convincing Rationales for Visual Question Answering Reasoning Feb 6, 2024 Question Answering Visual Question Answering
Code Code Available 0Curriculum reinforcement learning for quantum architecture search under hardware errors Feb 5, 2024 3D Architecture Computational Efficiency
— Unverified 0Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Feb 5, 2024 Science Question Answering Text-to-Video Generation
Code Code Available 4Text-Guided Image Clustering Feb 5, 2024 Clustering Image Captioning
Code Code Available 1GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering Feb 4, 2024 Language Modeling Language Modelling
Code Code Available 2Knowledge Generation for Zero-shot Knowledge-based VQA Feb 4, 2024 Question Answering Visual Question Answering
Code Code Available 0