| RLHF Workflow: From Reward Modeling to Online RLHF | May 13, 2024 | ChatbotHumanEval | CodeCode Available | 5 | 5 |
| TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space | Feb 27, 2024 | Contrastive LearningHallucination | CodeCode Available | 2 | 5 |
| In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation | Mar 3, 2024 | HallucinationTruthfulQA | CodeCode Available | 2 | 5 |
| Inference-Time Intervention: Eliciting Truthful Answers from a Language Model | Jun 6, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| Tuning Language Models by Proxy | Jan 16, 2024 | Domain AdaptationMath | CodeCode Available | 2 | 5 |
| DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models | Sep 7, 2023 | TruthfulQA | CodeCode Available | 2 | 5 |
| Instruction Tuning With Loss Over Instructions | May 23, 2024 | HumanEvalMMLU | CodeCode Available | 1 | 5 |
| Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics | Sep 13, 2023 | EthicsTruthfulQA | CodeCode Available | 1 | 5 |
| Non-Linear Inference Time Intervention: Improving LLM Truthfulness | Mar 27, 2024 | Large Language ModelMultiple-choice | CodeCode Available | 1 | 5 |
| Integrative Decoding: Improve Factuality via Implicit Self-consistency | Oct 2, 2024 | TruthfulQA | CodeCode Available | 1 | 5 |
| Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning | Dec 29, 2023 | TruthfulQA | CodeCode Available | 1 | 5 |
| TruthfulQA: Measuring How Models Mimic Human Falsehoods | Sep 8, 2021 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Alleviating Hallucinations of Large Language Models through Induced Hallucinations | Dec 25, 2023 | HallucinationHallucination Evaluation | CodeCode Available | 1 | 5 |
| RAIN: Your Language Models Can Align Themselves without Finetuning | Sep 13, 2023 | Adversarial AttackTruthfulQA | CodeCode Available | 1 | 5 |
| Tool-Augmented Reward Modeling | Oct 2, 2023 | TruthfulQA | CodeCode Available | 1 | 5 |
| Machine Unlearning in Large Language Models | May 24, 2024 | Machine UnlearningTruthfulQA | CodeCode Available | 1 | 5 |
| Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment | Aug 18, 2023 | MMLURed Teaming | CodeCode Available | 1 | 5 |
| DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning Ability | Mar 4, 2025 | GSM8KLogical Reasoning | CodeCode Available | 0 | 5 |
| Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding | Jun 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Steering Without Side Effects: Improving Post-Deployment Control of Language Models | Jun 21, 2024 | Red TeamingTruthfulQA | CodeCode Available | 0 | 5 |
| SaGE: Evaluating Moral Consistency in Large Language Models | Feb 21, 2024 | Decision MakingHellaSwag | CodeCode Available | 0 | 5 |
| A test suite of prompt injection attacks for LLM-based machine translation | Oct 7, 2024 | Machine TranslationTranslation | CodeCode Available | 0 | 5 |
| metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models | Jul 4, 2024 | ARCGSM8K | CodeCode Available | 0 | 5 |
| NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models | Oct 11, 2024 | Multiple-choiceTruthfulQA | CodeCode Available | 0 | 5 |
| CHAIR -- Classifier of Hallucination as Improver | Jan 5, 2025 | HallucinationMMLU | CodeCode Available | 0 | 5 |