| metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models | Jul 4, 2024 | ARCGSM8K | CodeCode Available | 0 | 5 |
| NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models | Oct 11, 2024 | Multiple-choiceTruthfulQA | CodeCode Available | 0 | 5 |
| Truth Knows No Language: Evaluating Truthfulness Beyond English | Feb 13, 2025 | InformativenessMachine Translation | CodeCode Available | 0 | 5 |
| Truth Neurons | May 18, 2025 | TruthfulQA | CodeCode Available | 0 | 5 |
| DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning Ability | Mar 4, 2025 | GSM8KLogical Reasoning | CodeCode Available | 0 | 5 |
| Unsupervised Elicitation of Language Models | Jun 11, 2025 | GSM8KTruthfulQA | CodeCode Available | 0 | 5 |
| VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation | Jun 25, 2024 | ARCBenchmarking | CodeCode Available | 0 | 5 |
| Teaching language models to support answers with verified quotes | Mar 21, 2022 | Fact CheckingNatural Questions | —Unverified | 0 | 0 |
| Towards Multilingual LLM Evaluation for European Languages | Oct 11, 2024 | ARCGSM8K | —Unverified | 0 | 0 |
| TruthFlow: Truthful LLM Generation via Representation Flow Correction | Feb 6, 2025 | HallucinationTruthfulQA | —Unverified | 0 | 0 |