| Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation | Apr 4, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| Do LLM Evaluators Prefer Themselves for a Reason? | Apr 4, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| Sample, Don't Search: Rethinking Test-Time Alignment for Language Models | Apr 4, 2025 | GSM8KMathematical Reasoning | —Unverified | 0 |
| LexPam: Legal Procedure Awareness-Guided Mathematical Reasoning | Apr 3, 2025 | Mathematical ReasoningQuestion Answering | —Unverified | 0 |
| LLM Library Learning Fails: A LEGO-Prover Case Study | Apr 3, 2025 | Mathematical ReasoningMisconceptions | —Unverified | 0 |
| LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems | Apr 3, 2025 | Mathematical Reasoning | —Unverified | 0 |
| How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study | Apr 1, 2025 | Code GenerationMath | —Unverified | 0 |
| Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics | Apr 1, 2025 | MathMathematical Problem-Solving | —Unverified | 0 |
| VerifiAgent: a Unified Verification Agent in Language Model Reasoning | Apr 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning | Apr 1, 2025 | MathMathematical Reasoning | —Unverified | 0 |