Question Answering

Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular benchmark datasets for evaluation question answering systems include SQuAD, HotPotQA, bAbI, TriviaQA, WikiQA, and many others. Models for question answering are typically evaluated on metrics like EM and F1. Some recent top performing models are T5 and XLNet.

( Image credit: SQuAD )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–10 of 10817 papers

Title	Date	Tasks	Status	Hype
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	Jan 22, 2025	Mathematical ReasoningMulti-task Language Understanding	CodeCode Available	15
From Local to Global: A Graph RAG Approach to Query-Focused Summarization	Apr 24, 2024	Query-focused SummarizationQuestion Answering	CodeCode Available	14
WebWalker: Benchmarking LLMs in Web Traversal	Jan 13, 2025	BenchmarkingOpen-Domain Question Answering	CodeCode Available	11
SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning	Aug 10, 2024	HallucinationOptical Character Recognition	CodeCode Available	11
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Dec 13, 2024	Chart UnderstandingMixture-of-Experts	CodeCode Available	9
KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation	Sep 10, 2024	Knowledge GraphsQuestion Answering	CodeCode Available	9
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack	Jun 14, 2024	Question AnsweringRetrieval-augmented Generation	CodeCode Available	9
Visually Descriptive Language Model for Vector Graphics Reasoning	Apr 9, 2024	DescriptiveLanguage Modeling	CodeCode Available	9
Llama 2: Open Foundation and Fine-Tuned Chat Models	Jul 18, 2023	Arithmetic Reasoning	CodeCode Available	8
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning	Jun 11, 2025	Action AnticipationLarge Language Model	CodeCode Available	7

Show:10 25 50

← PrevPage 1 of 1082Next →

All datasets SQuAD2.0 SQuAD1.1 HotpotQA PIQA BoolQ COPA TriviaQA SQuAD1.1 dev Natural Questions OpenBookQA TruthfulQA MultiRC

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IE-Net (ensemble)	EM	90.94	—	Unverified
2	FPNet (ensemble)	EM	90.87	—	Unverified
3	IE-NetV2 (ensemble)	EM	90.86	—	Unverified
4	SA-Net on Albert (ensemble)	EM	90.72	—	Unverified
5	SA-Net-V2 (ensemble)	EM	90.68	—	Unverified
6	FPNet (ensemble)	EM	90.6	—	Unverified
7	Retro-Reader (ensemble)	EM	90.58	—	Unverified
8	EntitySpanFocusV2 (ensemble)	EM	90.52	—	Unverified
9	TransNets + SFVerifier + SFEnsembler (ensemble)	EM	90.49	—	Unverified
10	EntitySpanFocus+AT (ensemble)	EM	90.45	—	Unverified