| TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving | Jun 12, 2025 | Logical ReasoningMathematical Problem-Solving | —Unverified | 0 |
| Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic | Jun 9, 2025 | Mathematical Reasoning | —Unverified | 0 |
| Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset | Jun 25, 2025 | Mathematical Reasoning | —Unverified | 0 |
| Text Generation Beyond Discrete Token Sampling | May 20, 2025 | Code GenerationMathematical Reasoning | —Unverified | 0 |
| The Axiom-Based Atlas: A Structural Mapping of Theorems via Foundational Proof Vectors | Mar 31, 2025 | Mathematical Reasoning | —Unverified | 0 |
| The Karp Dataset | Jan 24, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| The Lessons of Developing Process Reward Models in Mathematical Reasoning | Jan 13, 2025 | Mathematical Reasoning | —Unverified | 0 |
| Theorem Prover as a Judge for Synthetic Data Generation | Feb 18, 2025 | Mathematical ProofsMathematical Reasoning | —Unverified | 0 |
| Theoretical Analysis of an XGBoost Framework for Product Cannibalization | Dec 2, 2021 | Mathematical Reasoning | —Unverified | 0 |
| The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic | Jun 28, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |