| Towards Multilingual LLM Evaluation for European Languages | Oct 11, 2024 | ARCGSM8K | —Unverified | 0 | 0 |
| TruthFlow: Truthful LLM Generation via Representation Flow Correction | Feb 6, 2025 | HallucinationTruthfulQA | —Unverified | 0 | 0 |
| Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages | Dec 1, 2024 | ARCMultiple-choice | —Unverified | 0 | 0 |
| Uncertainty-aware Language Modeling for Selective Question Answering | Nov 26, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Unsupervised Elicitation of Language Models | Jun 11, 2025 | GSM8KTruthfulQA | —Unverified | 0 | 0 |
| When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR) | Apr 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models | Feb 20, 2025 | HellaSwagMemorization | —Unverified | 0 | 0 |
| Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts | Oct 11, 2024 | Holdout SetMisconceptions | —Unverified | 0 | 0 |
| Cost-Saving LLM Cascades with Early Abstention | Feb 13, 2025 | GSM8KMMLU | —Unverified | 0 | 0 |
| LLMAuditor: A Framework for Auditing Large Language Models Using Human-in-the-Loop | Feb 14, 2024 | HallucinationTruthfulQA | —Unverified | 0 | 0 |