A Simple Aerial Detection Baseline of Multimodal Language Models Jan 16, 2025 object-detection Object Detection
Code Code Available 2Dual Diffusion for Unified Image Generation and Understanding Dec 31, 2024 Image Generation Language Modeling
Code Code Available 2Online Video Understanding: OVBench and VideoChat-Online Dec 31, 2024 Autonomous Driving Question Answering
Code Code Available 2FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models Dec 30, 2024 Question Answering Token Reduction
Code Code Available 2Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models Dec 24, 2024 Question Answering Video Question Answering
Code Code Available 2Evaluating LLM Reasoning in the Operations Research Domain with ORQA Dec 22, 2024 Question Answering
Code Code Available 2AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving Dec 19, 2024 Autonomous Driving Benchmarking
Code Code Available 2SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation Dec 17, 2024 Fact Verification Knowledge Graphs
Code Code Available 2Neptune: The Long Orbit to Benchmarking Long Video Understanding Dec 12, 2024 Benchmarking Multimodal Reasoning
Code Code Available 2Doe-1: Closed-Loop Autonomous Driving with Large World Model Dec 12, 2024 Autonomous Driving Decision Making
Code Code Available 2Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine Dec 12, 2024 Language Modeling Language Modelling
Code Code Available 2BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities Dec 10, 2024 Medical Visual Question Answering Question Answering
Code Code Available 2TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action Dec 7, 2024 Depth Estimation Mathematical Reasoning
Code Code Available 2LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences Dec 2, 2024 Embodied Question Answering Question Answering
Code Code Available 2Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering Nov 26, 2024 Prognosis Question Answering
Code Code Available 2Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment Nov 26, 2024 Image Quality Assessment Question Answering
Code Code Available 2Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering Nov 25, 2024 Question Answering Visual Question Answering
Code Code Available 2VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Nov 22, 2024 Question Answering Video Question Answering
Code Code Available 2GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI Nov 21, 2024 Decision Making Language Modeling
Code Code Available 2MC-LLaVA: Multi-Concept Personalized Vision-Language Model Nov 18, 2024 Language Modeling Language Modelling
Code Code Available 2End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering Nov 8, 2024 Language Modeling Language Modelling
Code Code Available 2VQA^2: Visual Question Answering for Video Quality Assessment Nov 6, 2024 Question Answering Video Quality Assessment
Code Code Available 2Multi-Agent Large Language Models for Conversational Task-Solving Oct 30, 2024 Fairness Question Answering
Code Code Available 2LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering Oct 23, 2024 Chunking Question Answering
Code Code Available 2Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering Oct 21, 2024 Open-Domain Question Answering Question Answering
Code Code Available 2RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models Oct 17, 2024 Image Captioning Question Answering
Code Code Available 2VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI Oct 15, 2024 Question Answering Video Question Answering
Code Code Available 2SensorLLM: Human-Intuitive Alignment of Multivariate Sensor Data with LLMs for Activity Recognition Oct 14, 2024 Activity Recognition Descriptive
Code Code Available 2Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs Oct 14, 2024 Computational Efficiency Question Answering
Code Code Available 2Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation Oct 11, 2024 Open-Domain Question Answering Question Answering
Code Code Available 2VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis Oct 10, 2024 Medical Image Analysis Question Answering
Code Code Available 2Large Continual Instruction Assistant Oct 8, 2024 Question Answering Semantic Similarity
Code Code Available 2TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data Oct 8, 2024 Change Detection Earth Observation
Code Code Available 2PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling Oct 8, 2024 document understanding Language Modeling
Code Code Available 2Differential Transformer Oct 7, 2024 Hallucination In-Context Learning
Code Code Available 2QAEncoder: Towards Aligned Representation Learning in Question Answering System Sep 30, 2024 Document Embedding Question Answering
Code Code Available 2E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding Sep 26, 2024 Question Answering Video Understanding
Code Code Available 2Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning Sep 19, 2024 Language Modeling Language Modelling
Code Code Available 2TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning Sep 18, 2024 Fact Verification Question Answering
Code Code Available 2One missing piece in Vision and Language: A Survey on Comics Understanding Sep 14, 2024 document understanding image-classification
Code Code Available 2EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis Sep 10, 2024 Contrastive Learning Cross-Modal Retrieval
Code Code Available 2Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding Sep 5, 2024 Question Answering Scene Understanding
Code Code Available 2Towards Evaluating and Building Versatile Large Language Models for Medicine Aug 22, 2024 Multiple-choice named-entity-recognition
Code Code Available 2PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding Aug 18, 2024 Language Modelling Question Answering
Code Code Available 2A Survey on Benchmarks of Multimodal Large Language Models Aug 16, 2024 Question Answering Survey
Code Code Available 2ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area Aug 14, 2024 Language Modeling Language Modelling
Code Code Available 2EfficientRAG: Efficient Retriever for Multi-Hop Question Answering Aug 8, 2024 Multi-hop Question Answering Question Answering
Code Code Available 2GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Aug 6, 2024 Question Answering Visual Question Answering
Code Code Available 2500xCompressor: Generalized Prompt Compression for Large Language Models Aug 6, 2024 Language Modelling Large Language Model
Code Code Available 2XMainframe: A Large Language Model for Mainframe Modernization Aug 5, 2024 Code Summarization Language Modeling
Code Code Available 2