| VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation | Jun 25, 2024 | ARCBenchmarking | CodeCode Available | 0 |
| Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving | Dec 20, 2024 | Computational EfficiencyGSM8K | CodeCode Available | 0 |
| Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective | Feb 20, 2025 | GSM8KMath | CodeCode Available | 0 |
| In-Context Principle Learning from Mistakes | Feb 8, 2024 | GSM8KIn-Context Learning | CodeCode Available | 0 |
| AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations | Nov 22, 2023 | Common Sense ReasoningGSM8K | CodeCode Available | 0 |
| TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation | Feb 19, 2025 | Dataset GenerationGSM8K | CodeCode Available | 0 |
| TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students | May 2, 2025 | GSM8KIn-Context Learning | CodeCode Available | 0 |
| How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective | Oct 14, 2024 | Density Ratio EstimationGSM8K | CodeCode Available | 0 |
| GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment | May 30, 2024 | GSM8KKnowledge Distillation | CodeCode Available | 0 |
| Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls | Feb 16, 2025 | Computational EfficiencyGSM8K | CodeCode Available | 0 |