| Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts? | Mar 23, 2025 | GSM8KMath | CodeCode Available | 0 |
| Long Is More Important Than Difficult for Training Reasoning Models | Mar 23, 2025 | Math | —Unverified | 0 |
| ChatBench: From Static Benchmarks to Human-AI Evaluation | Mar 22, 2025 | MathMMLU | CodeCode Available | 0 |
| Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them | Mar 20, 2025 | MathMemorization | —Unverified | 0 |
| BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems | Mar 18, 2025 | CPUMath | —Unverified | 0 |
| Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs | Mar 18, 2025 | GSM8KMath | —Unverified | 0 |
| Pensez: Less Data, Better Reasoning -- Rethinking French LLM | Mar 17, 2025 | Large Language ModelMath | —Unverified | 0 |
| Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach | Mar 17, 2025 | GSM8KMath | —Unverified | 0 |
| SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially? | Mar 16, 2025 | Board GamesCard Games | —Unverified | 0 |
| The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory | Mar 13, 2025 | MathMultiple-choice | —Unverified | 0 |