| Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation | Jun 20, 2024 | GSM8KLanguage Model Evaluation | CodeCode Available | 0 |
| ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling | Feb 21, 2024 | MMLURetrieval | CodeCode Available | 0 |
| Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective | Feb 20, 2025 | GSM8KMath | CodeCode Available | 0 |
| When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards | Feb 1, 2024 | Answer SelectionLanguage Modeling | CodeCode Available | 0 |
| MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning | Feb 27, 2024 | 8kLanguage Modeling | CodeCode Available | 0 |
| DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance | Jan 29, 2025 | DiversityMMLU | CodeCode Available | 0 |
| BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | May 25, 2025 | General KnowledgeMMLU | CodeCode Available | 0 |
| Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models | May 16, 2025 | DiversityMMLU | CodeCode Available | 0 |
| Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models | Dec 2, 2024 | MMLUMultiple-choice | CodeCode Available | 0 |
| Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models | Sep 27, 2023 | HumanEvalLanguage Modeling | CodeCode Available | 0 |