| Efficient Federated Search for Retrieval-Augmented Generation | Feb 26, 2025 | MMLURAG | —Unverified | 0 |
| WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging | Feb 25, 2025 | MMLUMultiple-choice | CodeCode Available | 0 |
| Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks | Feb 24, 2025 | 2kARC | —Unverified | 0 |
| Distributional Scaling Laws for Emergent Capabilities | Feb 24, 2025 | MMLU | —Unverified | 0 |
| Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks | Feb 24, 2025 | Mixture-of-ExpertsMMLU | —Unverified | 0 |
| Detecting Benchmark Contamination Through Watermarking | Feb 24, 2025 | ARCMMLU | —Unverified | 0 |
| Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs | Feb 23, 2025 | Data PoisoningDiagnostic | —Unverified | 0 |
| Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models | Feb 20, 2025 | HellaSwagMemorization | —Unverified | 0 |
| Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests | Feb 20, 2025 | Logical ReasoningMMLU | —Unverified | 0 |
| Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective | Feb 20, 2025 | GSM8KMath | CodeCode Available | 0 |