GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities Jun 17, 2024 Audio Question Answering Instruction Following
Code Code Available 25 GOFA: A Generative One-For-All Model for Joint Graph Language Modeling Jul 12, 2024 All Language Modeling
Code Code Available 25 Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction Mar 27, 2024 Image Captioning Language Modeling
Code Code Available 25 An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Mar 27, 2024 Language Modeling Language Modelling
Code Code Available 25 MuGER^2: Multi-Granularity Evidence Retrieval and Reasoning for Hybrid Question Answering Oct 19, 2022 Navigate Question Answering
Code Code Available 25 Multi-Agent Large Language Models for Conversational Task-Solving Oct 30, 2024 Fairness Question Answering
Code Code Available 25 From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models Oct 13, 2023 Hallucination Image Captioning
Code Code Available 25 Cross-Task Generalization via Natural Language Crowdsourcing Instructions Apr 18, 2021 Question Answering
Code Code Available 25 ConceptLab: Creative Concept Generation using VLM-Guided Diffusion Prior Constraints Aug 3, 2023 Image Generation Language Modelling
Code Code Available 25 Neptune: The Long Orbit to Benchmarking Long Video Understanding Dec 12, 2024 Benchmarking Multimodal Reasoning
Code Code Available 25 From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks Jun 4, 2024 Image Captioning Language Modelling
Code Code Available 25 Can AI Assistants Know What They Don't Know? Jan 24, 2024 Math Open-Domain Question Answering
Code Code Available 25 FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models Dec 30, 2024 Question Answering Token Reduction
Code Code Available 25 FreeVA: Offline MLLM as Training-Free Video Assistant May 13, 2024 Fairness Question Answering
Code Code Available 25 F-LMM: Grounding Frozen Large Multimodal Models Jun 9, 2024 General Knowledge Instruction Following
Code Code Available 25 CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning Jun 7, 2024 Instruction Following Math
Code Code Available 25 FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning Apr 1, 2025 Audio-visual Question Answering Audio-Visual Question Answering (AVQA)
Code Code Available 25 Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs Oct 14, 2024 Computational Efficiency Question Answering
Code Code Available 25 PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding Aug 18, 2024 Language Modelling Question Answering
Code Code Available 25 PaLM-E: An Embodied Multimodal Language Model Mar 6, 2023 Language Modeling Language Modelling
Code Code Available 25 Frozen Transformers in Language Models Are Effective Visual Encoder Layers Oct 19, 2023 Action Recognition Image-text Retrieval
Code Code Available 25 PEDANTS: Cheap but Effective and Interpretable Answer Equivalence Feb 17, 2024 Benchmarking Form
Code Code Available 25 An Embodied Generalist Agent in 3D World Nov 18, 2023 3D dense captioning 3D Question Answering (3D-QA)
Code Code Available 25 FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design Nov 23, 2023 Decision Making Language Modelling
Code Code Available 25 PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging Jan 5, 2024 Medical Report Generation Medical Visual Question Answering
Code Code Available 25 Pengi: An Audio Language Model for Audio Tasks May 19, 2023 Audio captioning Audio Question Answering
Code Code Available 25 Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering Sep 29, 2023 Image to text Passage Retrieval
Code Code Available 25 Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation May 22, 2024 Informativeness Language Modeling
Code Code Available 25 PG-Video-LLaVA: Pixel Grounding Large Video-Language Models Nov 22, 2023 Benchmarking Phrase Grounding
Code Code Available 25 Fine-Grained Human Feedback Gives Better Rewards for Language Model Training Jun 2, 2023 Language Modeling Language Modelling
Code Code Available 25 FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation Jun 10, 2025 Image-text Retrieval Question Answering
Code Code Available 25 Atlas: Few-shot Learning with Retrieval Augmented Language Models Aug 5, 2022 Fact Checking Few-Shot Learning
Code Code Available 25 FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models Feb 21, 2024 Question Answering
Code Code Available 25 EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis Sep 10, 2024 Contrastive Learning Cross-Modal Retrieval
Code Code Available 25 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Oct 23, 2019 Answer Generation Common Sense Reasoning
Code Code Available 25 AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator Feb 15, 2024 Benchmarking Diagnostic
Code Code Available 25 FakeBench: Probing Explainable Fake Image Detection via Large Multimodal Models Apr 20, 2024 Binary Classification Fake Image Detection
Code Code Available 25 AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving Dec 19, 2024 Autonomous Driving Benchmarking
Code Code Available 25 FinBERT-QA: Financial Question Answering with pre-trained BERT Language Models Apr 24, 2025 Answer Selection Information Retrieval
Code Code Available 25 Evaluating LLM Reasoning in the Operations Research Domain with ORQA Dec 22, 2024 Question Answering
Code Code Available 25 Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework Jun 20, 2024 Hallucination Question Answering
Code Code Available 25 QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization Mar 11, 2022 image-classification Image Classification
Code Code Available 25 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding Sep 26, 2024 Question Answering Video Understanding
Code Code Available 25 Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models Jan 25, 2025 Attribute Contrastive Learning
Code Code Available 25 ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis Mar 11, 2024 Question Answering
Code Code Available 25 Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling Jun 18, 2024 Arithmetic Reasoning Language Modeling
Code Code Available 25 End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering Nov 8, 2024 Language Modeling Language Modelling
Code Code Available 25 Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement May 24, 2024 Hallucination Image Comprehension
Code Code Available 25 Explore the Limits of Omni-modal Pretraining at Scale Jun 13, 2024 Language Modeling Language Modelling
Code Code Available 25 Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers Mar 22, 2024 Information Retrieval
Code Code Available 25