| KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge | Feb 21, 2024 | 4kMultiple-choice | —Unverified | 0 |
| KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning | May 14, 2025 | BenchmarkingMMLU | —Unverified | 0 |
| LAB-Bench: Measuring Capabilities of Language Models for Biology Research | Jul 14, 2024 | Language ModellingMultiple-choice | —Unverified | 0 |
| LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs | Oct 18, 2024 | BenchmarkingFairness | —Unverified | 0 |
| Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model | Oct 1, 2024 | AllLanguage Modeling | —Unverified | 0 |
| Language models are susceptible to incorrect patient self-diagnosis in medical applications | Sep 17, 2023 | DiagnosticMultiple-choice | —Unverified | 0 |
| Uncovering Cultural Representation Disparities in Vision-Language Models | May 20, 2025 | Multiple-choice | —Unverified | 0 |
| Language Models (Mostly) Know What They Know | Jul 11, 2022 | Multiple-choice | —Unverified | 0 |
| Uncovering Temporal Context for Video Question and Answering | Nov 15, 2015 | DecoderMultiple-choice | —Unverified | 0 |
| LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights | Oct 17, 2024 | Legal ReasoningMultiple-choice | —Unverified | 0 |