Reading Comprehension
Most current question answering datasets frame the task as reading comprehension where the question is about a paragraph or document and the answer often is a span in the document.
Some specific tasks of reading comprehension include multi-modal machine reading comprehension and textual machine reading comprehension, among others. In the literature, machine reading comprehension can be divide into four categories: cloze style, multiple choice, span prediction, and free-form answer. Read more about each category here.
Benchmark datasets used for testing a model's reading comprehension abilities include MovieQA, ReCoRD, and RACE, among others.
The Machine Reading group at UCL also provides an overview of reading comprehension tasks.
Figure source: A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics and Benchmark Datasets
Papers
Showing 1–10 of 1760 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Rational Reasoner / IDOL | Test | 80.6 | — | Unverified |
| 2 | AMR-LE-Ensemble | Test | 80 | — | Unverified |
| 3 | MERIt(MERIt-deberta-v2-xxlarge ) | Test | 79.3 | — | Unverified |
| 4 | MERIt-deberta-v2-xxlarge deberta.v2.xxlarge.path.override_True.norm_1.1.0.w2.A100.cp200.s42 | Test | 79.3 | — | Unverified |
| 5 | Knowledge model | Test | 79.2 | — | Unverified |
| 6 | DeBERTa-v2-xxlarge-AMR-LE-Contraposition | Test | 77.2 | — | Unverified |
| 7 | LReasoner ensemble | Test | 76.1 | — | Unverified |
| 8 | ELECTRA and ALBERT | Test | 71 | — | Unverified |
| 9 | WWZ | Test | 69.7 | — | Unverified |
| 10 | xlnet-large-uncased [extended data] | Test | 69.3 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | ALBERT (Ensemble) | Accuracy | 91.4 | — | Unverified |
| 2 | Megatron-BERT (ensemble) | Accuracy | 90.9 | — | Unverified |
| 3 | ALBERTxxlarge+DUMA(ensemble) | Accuracy | 89.8 | — | Unverified |
| 4 | Megatron-BERT | Accuracy | 89.5 | — | Unverified |
| 5 | XLNet | Accuracy (Middle) | 88.6 | — | Unverified |
| 6 | DeBERTalarge | Accuracy | 86.8 | — | Unverified |
| 7 | B10-10-10 | Accuracy | 85.7 | — | Unverified |
| 8 | RoBERTa | Accuracy | 83.2 | — | Unverified |
| 9 | Orca 2-13B | Accuracy | 82.87 | — | Unverified |
| 10 | Orca 2-7B | Accuracy | 80.79 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Golden Transformer | Average F1 | 0.94 | — | Unverified |
| 2 | MT5 Large | Average F1 | 0.84 | — | Unverified |
| 3 | ruRoberta-large finetune | Average F1 | 0.83 | — | Unverified |
| 4 | ruT5-large-finetune | Average F1 | 0.82 | — | Unverified |
| 5 | Human Benchmark | Average F1 | 0.81 | — | Unverified |
| 6 | ruT5-base-finetune | Average F1 | 0.77 | — | Unverified |
| 7 | ruBert-large finetune | Average F1 | 0.76 | — | Unverified |
| 8 | ruBert-base finetune | Average F1 | 0.74 | — | Unverified |
| 9 | RuGPT3XL few-shot | Average F1 | 0.74 | — | Unverified |
| 10 | RuGPT3Large | Average F1 | 0.73 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | RoBERTa-Large | Overall: F1 | 64.4 | — | Unverified |
| 2 | BERT-Large | Overall: F1 | 62.7 | — | Unverified |
| 3 | BiDAF | Overall: F1 | 28.5 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | BERT | MSE | 0.05 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | BERT pretrained on MIMIC-III | Answer F1 | 63.55 | — | Unverified |