| Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge | Mar 14, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| How Additional Knowledge can Improve Natural Language Commonsense Question Answering? | Sep 19, 2019 | ArticlesLanguage Modeling | —Unverified | 0 | 0 |
| Exposing the Limits of Video-Text Models through Contrast Sets | Jan 16, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History | Jan 15, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees | Nov 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Towards Multistage Design of Modular Systems | Jun 19, 2013 | Multiple-choice | —Unverified | 0 | 0 |
| FAMULUS: Interactive Annotation and Feedback Generation for Teaching Diagnostic Reasoning | Aug 29, 2019 | DiagnosticMultiple-choice | —Unverified | 0 | 0 |
| FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models | Apr 20, 2025 | DescriptiveEthics | —Unverified | 0 | 0 |
| Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction | Jan 28, 2025 | Logical ReasoningMultiple-choice | —Unverified | 0 | 0 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |