Question Answering
Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular benchmark datasets for evaluation question answering systems include SQuAD, HotPotQA, bAbI, TriviaQA, WikiQA, and many others. Models for question answering are typically evaluated on metrics like EM and F1. Some recent top performing models are T5 and XLNet.
( Image credit: SQuAD )
Papers
Showing 1–10 of 10817 papers
All datasetsSQuAD2.0SQuAD1.1HotpotQAPIQABoolQCOPATriviaQASQuAD1.1 devNatural QuestionsOpenBookQATruthfulQAMultiRC
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Mistral-Nemo 12B (HPT) | Accuracy | 99.87 | — | Unverified |
| 2 | ST-MoE-32B 269B (fine-tuned) | Accuracy | 92.4 | — | Unverified |
| 3 | PaLM 540B (fine-tuned) | Accuracy | 92.2 | — | Unverified |
| 4 | Turing NLR v5 XXL 5.4B (fine-tuned) | Accuracy | 92 | — | Unverified |
| 5 | T5-XXL 11B (fine-tuned) | Accuracy | 91.2 | — | Unverified |
| 6 | PaLM 2-L (1-shot) | Accuracy | 90.9 | — | Unverified |
| 7 | UL2 20B (fine-tuned) | Accuracy | 90.8 | — | Unverified |
| 8 | Vega v2 6B (fine-tuned) | Accuracy | 90.5 | — | Unverified |
| 9 | DeBERTa-1.5B | Accuracy | 90.4 | — | Unverified |
| 10 | PaLM 2-M (1-shot) | Accuracy | 88.6 | — | Unverified |