| Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages | Dec 1, 2024 | ARCMultiple-choice | —Unverified | 0 | 0 |
| Uncertainty-aware Language Modeling for Selective Question Answering | Nov 26, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR) | Apr 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models | Feb 20, 2025 | HellaSwagMemorization | —Unverified | 0 | 0 |
| Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts | Oct 11, 2024 | Holdout SetMisconceptions | —Unverified | 0 | 0 |
| Cost-Saving LLM Cascades with Early Abstention | Feb 13, 2025 | GSM8KMMLU | —Unverified | 0 | 0 |
| LLMAuditor: A Framework for Auditing Large Language Models Using Human-in-the-Loop | Feb 14, 2024 | HallucinationTruthfulQA | —Unverified | 0 | 0 |
| DYNAMAX: Dynamic computing for Transformers and Mamba based architectures | Apr 29, 2025 | MambaTriviaQA | —Unverified | 0 | 0 |
| Efficiently Deploying LLMs with Controlled Risk | Oct 3, 2024 | MMLUTruthfulQA | —Unverified | 0 | 0 |
| Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer | Apr 17, 2025 | Conformal PredictionTruthfulQA | —Unverified | 0 | 0 |