Question Answering

Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular benchmark datasets for evaluation question answering systems include SQuAD, HotPotQA, bAbI, TriviaQA, WikiQA, and many others. Models for question answering are typically evaluated on metrics like EM and F1. Some recent top performing models are T5 and XLNet.

( Image credit: SQuAD )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 401–450 of 10817 papers

Title	Date	Tasks	Status	Hype	Score
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities	Jun 17, 2024	Audio Question AnsweringInstruction Following	CodeCode Available	2	5
GOFA: A Generative One-For-All Model for Joint Graph Language Modeling	Jul 12, 2024	AllLanguage Modeling	CodeCode Available	2	5
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction	Mar 27, 2024	Image CaptioningLanguage Modeling	CodeCode Available	2	5
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM	Mar 27, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
MuGER^2: Multi-Granularity Evidence Retrieval and Reasoning for Hybrid Question Answering	Oct 19, 2022	NavigateQuestion Answering	CodeCode Available	2	5
Multi-Agent Large Language Models for Conversational Task-Solving	Oct 30, 2024	FairnessQuestion Answering	CodeCode Available	2	5
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models	Oct 13, 2023	HallucinationImage Captioning	CodeCode Available	2	5
Cross-Task Generalization via Natural Language Crowdsourcing Instructions	Apr 18, 2021	Question Answering	CodeCode Available	2	5
ConceptLab: Creative Concept Generation using VLM-Guided Diffusion Prior Constraints	Aug 3, 2023	Image GenerationLanguage Modelling	CodeCode Available	2	5
Neptune: The Long Orbit to Benchmarking Long Video Understanding	Dec 12, 2024	BenchmarkingMultimodal Reasoning	CodeCode Available	2	5
From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks	Jun 4, 2024	Image CaptioningLanguage Modelling	CodeCode Available	2	5
Can AI Assistants Know What They Don't Know?	Jan 24, 2024	MathOpen-Domain Question Answering	CodeCode Available	2	5
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models	Dec 30, 2024	Question AnsweringToken Reduction	CodeCode Available	2	5
FreeVA: Offline MLLM as Training-Free Video Assistant	May 13, 2024	FairnessQuestion Answering	CodeCode Available	2	5
F-LMM: Grounding Frozen Large Multimodal Models	Jun 9, 2024	General KnowledgeInstruction Following	CodeCode Available	2	5
CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning	Jun 7, 2024	Instruction FollowingMath	CodeCode Available	2	5
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning	Apr 1, 2025	Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA)	CodeCode Available	2	5
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs	Oct 14, 2024	Computational EfficiencyQuestion Answering	CodeCode Available	2	5
PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding	Aug 18, 2024	Language ModellingQuestion Answering	CodeCode Available	2	5
PaLM-E: An Embodied Multimodal Language Model	Mar 6, 2023	Language ModelingLanguage Modelling	CodeCode Available	2	5
Frozen Transformers in Language Models Are Effective Visual Encoder Layers	Oct 19, 2023	Action RecognitionImage-text Retrieval	CodeCode Available	2	5
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence	Feb 17, 2024	BenchmarkingForm	CodeCode Available	2	5
An Embodied Generalist Agent in 3D World	Nov 18, 2023	3D dense captioning3D Question Answering (3D-QA)	CodeCode Available	2	5
FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design	Nov 23, 2023	Decision MakingLanguage Modelling	CodeCode Available	2	5
PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging	Jan 5, 2024	Medical Report GenerationMedical Visual Question Answering	CodeCode Available	2	5
Pengi: An Audio Language Model for Audio Tasks	May 19, 2023	Audio captioningAudio Question Answering	CodeCode Available	2	5
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering	Sep 29, 2023	Image to textPassage Retrieval	CodeCode Available	2	5
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation	May 22, 2024	InformativenessLanguage Modeling	CodeCode Available	2	5
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	Nov 22, 2023	BenchmarkingPhrase Grounding	CodeCode Available	2	5
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training	Jun 2, 2023	Language ModelingLanguage Modelling	CodeCode Available	2	5
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation	Jun 10, 2025	Image-text RetrievalQuestion Answering	CodeCode Available	2	5
Atlas: Few-shot Learning with Retrieval Augmented Language Models	Aug 5, 2022	Fact CheckingFew-Shot Learning	CodeCode Available	2	5
FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models	Feb 21, 2024	Question Answering	CodeCode Available	2	5
EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis	Sep 10, 2024	Contrastive LearningCross-Modal Retrieval	CodeCode Available	2	5
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	Oct 23, 2019	Answer GenerationCommon Sense Reasoning	CodeCode Available	2	5
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator	Feb 15, 2024	BenchmarkingDiagnostic	CodeCode Available	2	5
FakeBench: Probing Explainable Fake Image Detection via Large Multimodal Models	Apr 20, 2024	Binary ClassificationFake Image Detection	CodeCode Available	2	5
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving	Dec 19, 2024	Autonomous DrivingBenchmarking	CodeCode Available	2	5
FinBERT-QA: Financial Question Answering with pre-trained BERT Language Models	Apr 24, 2025	Answer SelectionInformation Retrieval	CodeCode Available	2	5
Evaluating LLM Reasoning in the Operations Research Domain with ORQA	Dec 22, 2024	Question Answering	CodeCode Available	2	5
Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework	Jun 20, 2024	HallucinationQuestion Answering	CodeCode Available	2	5
QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization	Mar 11, 2022	image-classificationImage Classification	CodeCode Available	2	5
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding	Sep 26, 2024	Question AnsweringVideo Understanding	CodeCode Available	2	5
Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models	Jan 25, 2025	AttributeContrastive Learning	CodeCode Available	2	5
ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis	Mar 11, 2024	Question Answering	CodeCode Available	2	5
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling	Jun 18, 2024	Arithmetic ReasoningLanguage Modeling	CodeCode Available	2	5
End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering	Nov 8, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement	May 24, 2024	HallucinationImage Comprehension	CodeCode Available	2	5
Explore the Limits of Omni-modal Pretraining at Scale	Jun 13, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers	Mar 22, 2024	Information Retrieval	CodeCode Available	2	5

Show:10 25 50

← PrevPage 9 of 217Next →

All datasets SQuAD2.0 SQuAD1.1 HotpotQA PIQA BoolQ COPA TriviaQA SQuAD1.1 dev Natural Questions OpenBookQA TruthfulQA MultiRC

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IE-Net (ensemble)	EM	90.94	—	Unverified
2	FPNet (ensemble)	EM	90.87	—	Unverified
3	IE-NetV2 (ensemble)	EM	90.86	—	Unverified
4	SA-Net on Albert (ensemble)	EM	90.72	—	Unverified
5	SA-Net-V2 (ensemble)	EM	90.68	—	Unverified
6	FPNet (ensemble)	EM	90.6	—	Unverified
7	Retro-Reader (ensemble)	EM	90.58	—	Unverified
8	EntitySpanFocusV2 (ensemble)	EM	90.52	—	Unverified
9	TransNets + SFVerifier + SFEnsembler (ensemble)	EM	90.49	—	Unverified
10	EntitySpanFocus+AT (ensemble)	EM	90.45	—	Unverified