Question Answering
Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular benchmark datasets for evaluation question answering systems include SQuAD, HotPotQA, bAbI, TriviaQA, WikiQA, and many others. Models for question answering are typically evaluated on metrics like EM and F1. Some recent top performing models are T5 and XLNet.
( Image credit: SQuAD )
Papers
Showing 1–10 of 10817 papers
All datasetsSQuAD2.0SQuAD1.1HotpotQAPIQABoolQCOPATriviaQASQuAD1.1 devNatural QuestionsOpenBookQATruthfulQAMultiRC
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PaLM 540B (finetuned) | Accuracy | 100 | — | Unverified |
| 2 | Vega v2 6B (KD-based prompt transfer) | Accuracy | 99.4 | — | Unverified |
| 3 | ST-MoE-32B 269B (fine-tuned) | Accuracy | 99.2 | — | Unverified |
| 4 | UL2 20B (fine-tuned) | Accuracy | 99 | — | Unverified |
| 5 | DeBERTa-Ensemble | Accuracy | 98.4 | — | Unverified |
| 6 | Turing NLR v5 XXL 5.4B (fine-tuned) | Accuracy | 98.2 | — | Unverified |
| 7 | DeBERTa-1.5B | Accuracy | 96.8 | — | Unverified |
| 8 | PaLM 2-L (1-shot) | Accuracy | 96 | — | Unverified |
| 9 | T5-XXL 11B (fine-tuned) | Accuracy | 94.8 | — | Unverified |
| 10 | FLAN 137B (prompt-tuned) | Accuracy | 94 | — | Unverified |