| Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators | Apr 21, 2025 | Code GenerationInstruction Following | CodeCode Available | 0 |
| LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception | Apr 21, 2025 | MathMMLU | —Unverified | 0 |
| OTC: Optimal Tool Calls via Reinforcement Learning | Apr 21, 2025 | Mathreinforcement-learning | —Unverified | 0 |
| Enhancing Math Learning in an LMS Using AI-Driven Question Recommendations | Apr 18, 2025 | ManagementMath | —Unverified | 0 |
| Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? | Apr 18, 2025 | MathVisual Reasoning | —Unverified | 0 |
| In between myth and reality: AI for math -- a case study in category theory | Apr 17, 2025 | Math | —Unverified | 0 |
| THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models | Apr 17, 2025 | BenchmarkingMath | —Unverified | 0 |
| MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection | Apr 17, 2025 | Anomaly DetectionData Augmentation | —Unverified | 0 |
| Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation | Apr 16, 2025 | GSM8KMath | —Unverified | 0 |
| Rethinking the Generation of High-Quality CoT Data from the Perspective of LLM-Adaptive Question Difficulty Grading | Apr 16, 2025 | 2kCode Generation | —Unverified | 0 |