SOTAVerified

Question Answering

Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular benchmark datasets for evaluation question answering systems include SQuAD, HotPotQA, bAbI, TriviaQA, WikiQA, and many others. Models for question answering are typically evaluated on metrics like EM and F1. Some recent top performing models are T5 and XLNet.

( Image credit: SQuAD )

Papers

Showing 97519775 of 10817 papers

TitleStatusHype
What or Who is Multilingual Watson?0
What Question Answering can Learn from Trivia Nerds0
What Should I Do Now? Marrying Reinforcement Learning and Symbolic Planning0
What's in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams0
What's in your Head? Emergent Behaviour in Multi-Task Transformer Models0
What’s in Your Head? Emergent Behaviour in Multi-Task Transformer Models0
What Would a Teacher Do? Predicting Future Talk Moves0
What Would it Take to get Biomedical QA Systems into Practice?0
When ACE met KBP: End-to-End Evaluation of Knowledge Base Population with Component-level Annotation0
When are Lemons Purple? The Concept Association Bias of Vision-Language Models0
When Crowd Meets Persona: Creating a Large-Scale Open-Domain Persona Dialogue Corpus0
When Giant Language Brains Just Aren't Enough! Domain Pizzazz with Knowledge Sparkle Dust0
When is dataset cartography ineffective? Using training dynamics does not improve robustness against Adversarial SQuAD0
When to Read Documents or QA History: On Unified and Selective Open-domain QA0
When to Speak, When to Abstain: Contrastive Decoding with Abstention0
When Two LLMs Debate, Both Think They'll Win0
Where is Linked Data in Question Answering over Linked Data?0
Where is this coming from? Making groundedness count in the evaluation of Document VQA models0
Where To Look: Focus Regions for Visual Question Answering0
Where Was Alexander the Great in 325 BC? Toward Understanding History Text with a World Model0
Where Was COVID-19 First Discovered? Designing a Question-Answering System for Pandemic Situations0
Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering0
Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering0
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above0
Which Step Do I Take First? Troubleshooting with Bayesian Models0
Show:102550
← PrevPage 391 of 433Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1IE-Net (ensemble)EM90.94Unverified
2FPNet (ensemble)EM90.87Unverified
3IE-NetV2 (ensemble)EM90.86Unverified
4SA-Net on Albert (ensemble)EM90.72Unverified
5SA-Net-V2 (ensemble)EM90.68Unverified
6FPNet (ensemble)EM90.6Unverified
7Retro-Reader (ensemble)EM90.58Unverified
8EntitySpanFocusV2 (ensemble)EM90.52Unverified
9TransNets + SFVerifier + SFEnsembler (ensemble)EM90.49Unverified
10EntitySpanFocus+AT (ensemble)EM90.45Unverified