GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 25 GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Aug 6, 2024 Question Answering Visual Question Answering
Code Code Available 25 GeoChat: Grounded Large Vision-Language Model for Remote Sensing Nov 24, 2023 Instruction Following Language Modeling
Code Code Available 25 Generate-on-Graph: Treat LLM as both Agent and KG in Incomplete Knowledge Graph Question Answering Apr 23, 2024 Graph Question Answering Hallucination
Code Code Available 25 GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering Feb 4, 2024 Language Modeling Language Modelling
Code Code Available 25 CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios Mar 7, 2024 Audio-visual Question Answering Audio-Visual Question Answering (AVQA)
Code Code Available 25 GOFA: A Generative One-For-All Model for Joint Graph Language Modeling Jul 12, 2024 All Language Modeling
Code Code Available 25 HMT: Hierarchical Memory Transformer for Long Context Language Processing May 9, 2024 Language Modeling Language Modelling
Code Code Available 25 Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning Mar 6, 2024 Multimodal Reasoning Question Answering
Code Code Available 25 From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models Oct 13, 2023 Hallucination Image Captioning
Code Code Available 25 Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling Jun 18, 2024 Arithmetic Reasoning Language Modeling
Code Code Available 25 Learning to Filter Context for Retrieval-Augmented Generation Nov 14, 2023 Extractive Question-Answering Fact Verification
Code Code Available 25 Learnware of Language Models: Specialized Small Language Models Can Do Big May 19, 2025 Privacy Preserving Question Answering
Code Code Available 25 Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA Jun 25, 2024 Benchmarking Long-Context Understanding
Code Code Available 25 A Replication Study of Dense Passage Retriever Apr 12, 2021 Open-Domain Question Answering Question Answering
Code Code Available 25 Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding Sep 5, 2024 Question Answering Scene Understanding
Code Code Available 25 An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Mar 27, 2024 Language Modeling Language Modelling
Code Code Available 25 Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs Oct 14, 2024 Computational Efficiency Question Answering
Code Code Available 25 From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks Jun 4, 2024 Image Captioning Language Modelling
Code Code Available 25 FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning Apr 1, 2025 Audio-visual Question Answering Audio-Visual Question Answering (AVQA)
Code Code Available 25 F-LMM: Grounding Frozen Large Multimodal Models Jun 9, 2024 General Knowledge Instruction Following
Code Code Available 25 FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models Dec 30, 2024 Question Answering Token Reduction
Code Code Available 25 FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation Jun 10, 2025 Image-text Retrieval Question Answering
Code Code Available 25 FreeVA: Offline MLLM as Training-Free Video Assistant May 13, 2024 Fairness Question Answering
Code Code Available 25 Frozen Transformers in Language Models Are Effective Visual Encoder Layers Oct 19, 2023 Action Recognition Image-text Retrieval
Code Code Available 25 ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions Mar 12, 2023 Image Captioning Question Answering
Code Code Available 25 LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Jun 27, 2025 Question Answering Video Question Answering
Code Code Available 25 ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild Jul 4, 2024 Chart Understanding Decision Making
Code Code Available 25 Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers Mar 22, 2024 Information Retrieval
Code Code Available 25 Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers Dec 13, 2023 3D Question Answering (3D-QA) Attribute
Code Code Available 25 LLoCO: Learning Long Contexts Offline Apr 11, 2024 4k In-Context Learning
Code Code Available 25 A Simple Aerial Detection Baseline of Multimodal Language Models Jan 16, 2025 object-detection Object Detection
Code Code Available 25 An Embodied Generalist Agent in 3D World Nov 18, 2023 3D dense captioning 3D Question Answering (3D-QA)
Code Code Available 25 Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering Sep 29, 2023 Image to text Passage Retrieval
Code Code Available 25 LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding Jul 22, 2024 Multiple-choice Question Answering
Code Code Available 25 LongVLM: Efficient Long Video Understanding via Large Language Models Apr 4, 2024 Question Answering Video Question Answering
Code Code Available 25 LOVA3: Learning to Visual Question Answering, Asking and Assessment May 23, 2024 Question Answering Visual Question Answering
Code Code Available 25 LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences Dec 2, 2024 Embodied Question Answering Question Answering
Code Code Available 25 FinBERT-QA: Financial Question Answering with pre-trained BERT Language Models Apr 24, 2025 Answer Selection Information Retrieval
Code Code Available 25 LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models Jun 15, 2023 Hallucination Image Captioning
Code Code Available 25 BlendSQL: A Scalable Dialect for Unifying Hybrid Question Answering in Relational Algebra Feb 27, 2024 Question Answering
Code Code Available 25 MASKSEARCH: A Universal Pre-Training Framework to Enhance Agentic Search Capability May 26, 2025 Multi-hop Question Answering Question Answering
Code Code Available 25 CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models Jun 14, 2024 Multiple-choice Question Answering
Code Code Available 25 A Survey on Benchmarks of Multimodal Large Language Models Aug 16, 2024 Question Answering Survey
Code Code Available 25 Fine-Grained Human Feedback Gives Better Rewards for Language Model Training Jun 2, 2023 Language Modeling Language Modelling
Code Code Available 25 MDETR - Modulated Detection for End-to-End Multi-Modal Understanding Jan 1, 2021 Phrase Grounding Question Answering
Code Code Available 25 Measuring and Narrowing the Compositionality Gap in Language Models Oct 7, 2022 Question Answering
Code Code Available 25 FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models Feb 21, 2024 Question Answering
Code Code Available 25 Atlas: Few-shot Learning with Retrieval Augmented Language Models Aug 5, 2022 Fact Checking Few-Shot Learning
Code Code Available 25 BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains Feb 15, 2024 Few-Shot Learning Medical Question Answering
Code Code Available 25