| Exposing the Limits of Video-Text Models through Contrast Sets | Jan 16, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History | Jan 15, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees | Nov 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Towards Multistage Design of Modular Systems | Jun 19, 2013 | Multiple-choice | —Unverified | 0 |
| FAMULUS: Interactive Annotation and Feedback Generation for Teaching Diagnostic Reasoning | Aug 29, 2019 | DiagnosticMultiple-choice | —Unverified | 0 |
| FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models | Apr 20, 2025 | DescriptiveEthics | —Unverified | 0 |
| Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction | Jan 28, 2025 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models | Mar 15, 2024 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| Field-testing items using artificial intelligence: Natural language processing with transformers | Oct 18, 2023 | Multiple-choice | —Unverified | 0 |