Question Answering

Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular benchmark datasets for evaluation question answering systems include SQuAD, HotPotQA, bAbI, TriviaQA, WikiQA, and many others. Models for question answering are typically evaluated on metrics like EM and F1. Some recent top performing models are T5 and XLNet.

( Image credit: SQuAD )

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 651–700 of 10817 papers

Title	Date	Tasks	Status	Hype
DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering	Mar 5, 2025	3D Question Answering (3D-QA)Question Answering	CodeCode Available	1
ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models	Feb 27, 2025	Question AnsweringRAG	CodeCode Available	1
FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users	Feb 26, 2025	In-Context LearningMeta-Learning	CodeCode Available	1
UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering	Feb 26, 2025	Question Answering	CodeCode Available	1
MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks	Feb 25, 2025	MisinformationQuestion Answering	CodeCode Available	1
HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization	Feb 24, 2025	DiversityFact Verification	CodeCode Available	1
KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse	Feb 21, 2025	Question Answering	CodeCode Available	1
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model	Feb 20, 2025	Mixture-of-ExpertsQuestion Answering	CodeCode Available	1
How to Get Your LLM to Generate Challenging Problems for Evaluation	Feb 20, 2025	Code CompletionMath	CodeCode Available	1
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information	Feb 20, 2025	Question Answering	CodeCode Available	1
Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps	Feb 20, 2025	Question Answering	CodeCode Available	1
PeerQA: A Scientific Question Answering Dataset from Peer Reviews	Feb 19, 2025	answerability predictionAnswer Generation	CodeCode Available	1
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space	Feb 18, 2025	Embodied Question AnsweringQuestion Answering	CodeCode Available	1
MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression	Feb 17, 2025	DiagnosticQuestion Answering	CodeCode Available	1
The Mirage of Model Editing: Revisiting Evaluation in the Wild	Feb 16, 2025	Model EditingQuestion Answering	CodeCode Available	1
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering	Feb 11, 2025	Question AnsweringVideo Question Answering	CodeCode Available	1
LM2: Large Memory Models	Feb 9, 2025	DecoderMMLU	CodeCode Available	1
Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs	Feb 7, 2025	Federated LearningMedical Question Answering	CodeCode Available	1
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?	Feb 6, 2025	Question AnsweringReferring Expression	CodeCode Available	1
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes	Feb 4, 2025	Autonomous DrivingMultiple-choice	CodeCode Available	1
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models	Feb 3, 2025	Adversarial RobustnessImage Captioning	CodeCode Available	1
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation	Jan 31, 2025	Question AnsweringVideo Question Answering	CodeCode Available	1
KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search	Jan 31, 2025	Heuristic SearchKnowledge Base Question Answering	CodeCode Available	1
o3-mini vs DeepSeek-R1: Which One is Safer?	Jan 30, 2025	Code GenerationProgram Repair	CodeCode Available	1
DRESSing Up LLM: Efficient Stylized Question-Answering via Style Subspace Editing	Jan 24, 2025	Language ModelingLanguage Modelling	CodeCode Available	1
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models	Jan 19, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning	Jan 13, 2025	Causal DiscoveryCausal Inference	CodeCode Available	1
SensorQA: A Question Answering Benchmark for Daily-Life Monitoring	Jan 9, 2025	Question Answering	CodeCode Available	1
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark	Jan 9, 2025	FairnessHallucination	CodeCode Available	1
VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models	Jan 9, 2025	BenchmarkingMathematical Problem-Solving	CodeCode Available	1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation	Jan 6, 2025	Language Model EvaluationLanguage Modeling	CodeCode Available	1
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?	Jan 5, 2025	Image CaptioningImage to text	CodeCode Available	1
Predicting the Performance of Black-box LLMs through Self-Queries	Jan 2, 2025	Question Answering	CodeCode Available	1
Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering	Jan 1, 2025	Large Language ModelMultimodal Large Language Model	CodeCode Available	1
Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner	Dec 30, 2024	Question AnsweringTable Recognition	CodeCode Available	1
Long Context vs. RAG for LLMs: An Evaluation and Revisits	Dec 27, 2024	Question AnsweringRAG	CodeCode Available	1
Interacted Object Grounding in Spatio-Temporal Human-Object Interactions	Dec 27, 2024	Human-Object Interaction DetectionObject	CodeCode Available	1
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation	Dec 24, 2024	Graph Question AnsweringHallucination	CodeCode Available	1
CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era	Dec 24, 2024	Knowledge Base Question AnsweringKnowledge Graphs	CodeCode Available	1
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating	Dec 24, 2024	document understandingQuestion Answering	CodeCode Available	1
Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models	Dec 24, 2024	Machine TranslationMolecular Property Prediction	CodeCode Available	1
Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing	Dec 23, 2024	ArabicMMLUDialect Identification	CodeCode Available	1
Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding	Dec 21, 2024	AttributeQuestion Answering	CodeCode Available	1
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization	Dec 19, 2024	Contrastive LearningDecision Making	CodeCode Available	1
Knowledge Editing with Dynamic Knowledge Graphs for Multi-Hop Question Answering	Dec 18, 2024	graph constructionknowledge editing	CodeCode Available	1
MedCoT: Medical Chain of Thought via Hierarchical Expert	Dec 18, 2024	DiagnosticMedical Visual Question Answering	CodeCode Available	1
EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation	Dec 17, 2024	Question AnsweringRAG	CodeCode Available	1
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants	Dec 17, 2024	Image CaptioningQuestion Answering	CodeCode Available	1
SCITAT: A Question Answering Benchmark for Scientific Tables and Text Covering Diverse Reasoning Types	Dec 16, 2024	Question Answering	CodeCode Available	1
UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models	Dec 16, 2024	Question Answering	CodeCode Available	1

Show:10 25 50

← PrevPage 14 of 217Next →

All datasets SQuAD2.0 SQuAD1.1 HotpotQA PIQA BoolQ COPA TriviaQA SQuAD1.1 dev Natural Questions OpenBookQA TruthfulQA MultiRC

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IE-Net (ensemble)	EM	90.94	—	Unverified
2	FPNet (ensemble)	EM	90.87	—	Unverified
3	IE-NetV2 (ensemble)	EM	90.86	—	Unverified
4	SA-Net on Albert (ensemble)	EM	90.72	—	Unverified
5	SA-Net-V2 (ensemble)	EM	90.68	—	Unverified
6	FPNet (ensemble)	EM	90.6	—	Unverified
7	Retro-Reader (ensemble)	EM	90.58	—	Unverified
8	EntitySpanFocusV2 (ensemble)	EM	90.52	—	Unverified
9	TransNets + SFVerifier + SFEnsembler (ensemble)	EM	90.49	—	Unverified
10	EntitySpanFocus+AT (ensemble)	EM	90.45	—	Unverified