SOTAVerified

Reading Comprehension

Most current question answering datasets frame the task as reading comprehension where the question is about a paragraph or document and the answer often is a span in the document.

Some specific tasks of reading comprehension include multi-modal machine reading comprehension and textual machine reading comprehension, among others. In the literature, machine reading comprehension can be divide into four categories: cloze style, multiple choice, span prediction, and free-form answer. Read more about each category here.

Benchmark datasets used for testing a model's reading comprehension abilities include MovieQA, ReCoRD, and RACE, among others.

The Machine Reading group at UCL also provides an overview of reading comprehension tasks.

Figure source: A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics and Benchmark Datasets

Papers

Showing 301350 of 1760 papers

TitleStatusHype
Modeling Hierarchical Reasoning Chains by Linking Discourse Units and Key Phrases for Reading ComprehensionCode1
Improving Reading Comprehension Question Generation with Data Augmentation and Overgenerate-and-rankCode0
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional UnderstandingCode1
Bridging the Gap between Decision and Logits in Decision-based Knowledge Distillation for Pre-trained Language ModelsCode0
Improving Opinion-based Question Answering Systems Through Label Error Detection and Overwrite0
PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts0
Knowing-how & Knowing-that: A New Task for Machine Comprehension of User ManualsCode0
LogiQA 2.0—An Improved Dataset for Logical Reasoning in Natural Language UnderstandingCode0
How Many Answers Should I Give? An Empirical Study of Multi-Answer Reading ComprehensionCode0
Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker0
Towards Flow Graph Prediction of Open-Domain Procedural Texts0
Large Language Models Are Not Strong Abstract ReasonersCode1
A Practical Toolkit for Multilingual Question and Answer Generation0
GenQ: Automated Question Generation to Support Caregivers While Reading Stories with Children0
Machine Reading Comprehension using Case-based Reasoning0
A Causal View of Entity Bias in (Large) Language ModelsCode0
Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited QuestionsCode0
ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings0
mPMR: A Multilingual Pre-trained Machine Reader at ScaleCode0
WYWEB: A NLP Evaluation Benchmark For Classical ChineseCode1
NarrativeXL: A Large-scale Dataset For Long-Term Memory ModelsCode1
DUBLIN -- Document Understanding By Language-Image Network0
Cross-functional Analysis of Generalisation in Behavioural LearningCode0
Leveraging Human Feedback to Scale Educational Datasets: Combining Crowdworkers and Comparative Judgement0
Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical ReasoningCode1
VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language ModelsCode1
S^3HQA: A Three-Stage Approach for Multi-hop Text-Table Hybrid Question AnsweringCode1
A quantitative study of NLP approaches to question difficulty estimationCode0
EMBRACE: Evaluation and Modifications for Boosting RACECode0
What's the Meaning of Superhuman Performance in Today's NLU?0
Coreference-aware Double-channel Attention Network for Multi-party Dialogue Reading ComprehensionCode0
SkillQG: Learning to Generate Question for Reading Comprehension Assessment0
NER-to-MRC: Named-Entity Recognition Completely Solving as Machine Reading Comprehension0
Adaptive loose optimization for robust question answeringCode0
Multi-View Graph Representation Learning for Answering Hybrid Numerical Reasoning QuestionCode0
A Large Cross-Modal Video Retrieval Dataset with Reading ComprehensionCode1
NorQuAD: Norwegian Question Answering DatasetCode1
Information Extraction from Documents: Question Answering vs Token Classification in real-world setups0
DISTO: Evaluating Textual Distractors for Multi-Choice Questions using Negative Sampling based Approach0
Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering0
Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4Code1
Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming0
MiniRBT: A Two-stage Distilled Small Chinese Pre-trained ModelCode2
Deep Manifold Learning for Reading Comprehension and Logical Reasoning Tasks with Polytuplet LossCode0
A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets0
A Multiple Choices Reading Comprehension Corpus for Vietnamese Language EducationCode0
BloombergGPT: A Large Language Model for Finance0
Automatic Generation of Multiple-Choice Questions0
Context-faithful Prompting for Large Language ModelsCode1
Revealing Weaknesses of Vietnamese Language Models Through Unanswerable Questions in Machine Reading Comprehension0
Show:102550
← PrevPage 7 of 36Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Rational Reasoner / IDOLTest80.6Unverified
2AMR-LE-EnsembleTest80Unverified
3MERIt(MERIt-deberta-v2-xxlarge )Test79.3Unverified
4MERIt-deberta-v2-xxlarge deberta.v2.xxlarge.path.override_True.norm_1.1.0.w2.A100.cp200.s42Test79.3Unverified
5Knowledge modelTest79.2Unverified
6DeBERTa-v2-xxlarge-AMR-LE-ContrapositionTest77.2Unverified
7LReasoner ensembleTest76.1Unverified
8ELECTRA and ALBERTTest71Unverified
9WWZTest69.7Unverified
10xlnet-large-uncased [extended data]Test69.3Unverified
#ModelMetricClaimedVerifiedStatus
1ALBERT (Ensemble)Accuracy91.4Unverified
2Megatron-BERT (ensemble)Accuracy90.9Unverified
3ALBERTxxlarge+DUMA(ensemble)Accuracy89.8Unverified
4Megatron-BERTAccuracy89.5Unverified
5XLNetAccuracy (Middle)88.6Unverified
6DeBERTalargeAccuracy86.8Unverified
7B10-10-10Accuracy85.7Unverified
8RoBERTaAccuracy83.2Unverified
9Orca 2-13BAccuracy82.87Unverified
10Orca 2-7BAccuracy80.79Unverified
#ModelMetricClaimedVerifiedStatus
1Golden TransformerAverage F10.94Unverified
2MT5 LargeAverage F10.84Unverified
3ruRoberta-large finetuneAverage F10.83Unverified
4ruT5-large-finetuneAverage F10.82Unverified
5Human BenchmarkAverage F10.81Unverified
6ruT5-base-finetuneAverage F10.77Unverified
7ruBert-large finetuneAverage F10.76Unverified
8ruBert-base finetuneAverage F10.74Unverified
9RuGPT3XL few-shotAverage F10.74Unverified
10RuGPT3LargeAverage F10.73Unverified
#ModelMetricClaimedVerifiedStatus
1RoBERTa-LargeOverall: F164.4Unverified
2BERT-LargeOverall: F162.7Unverified
3BiDAFOverall: F128.5Unverified
#ModelMetricClaimedVerifiedStatus
1BERTMSE0.05Unverified
#ModelMetricClaimedVerifiedStatus
1BERT pretrained on MIMIC-IIIAnswer F163.55Unverified