| TSQA: Tabular Scenario Based Question Answering | Jan 14, 2021 | Machine Reading ComprehensionMultiple-choice | CodeCode Available | 1 |
| TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes | Feb 4, 2025 | Autonomous DrivingMultiple-choice | CodeCode Available | 1 |
| Counterfactual Variable Control for Robust and Interpretable Question Answering | Oct 12, 2020 | Causal Inferencecounterfactual | CodeCode Available | 1 |
| Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models | Jul 15, 2024 | Backdoor AttackMultiple-choice | CodeCode Available | 1 |
| Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs | Mar 12, 2024 | Knowledge GraphsMultiple-choice | CodeCode Available | 1 |
| Assessing the Chemical Intelligence of Large Language Models | May 12, 2025 | Multiple-choice | CodeCode Available | 1 |
| Unsupervised Commonsense Question Answering with Self-Talk | Apr 11, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Conformal Prediction with Large Language Models for Multi-Choice Question Answering | May 28, 2023 | Conformal PredictionMultiple-choice | CodeCode Available | 1 |
| Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom | Apr 30, 2024 | ImplicaturesMultiple-choice | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |