| Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models | Jul 12, 2024 | GSM8KMath | —Unverified | 0 |
| Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist | Jul 11, 2024 | GSM8KMath | —Unverified | 0 |
| Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On | Jul 11, 2024 | GSM8KMath | —Unverified | 0 |
| When is the consistent prediction likely to be a correct prediction? | Jul 8, 2024 | GSM8KPrediction | —Unverified | 0 |
| Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks | Jul 4, 2024 | GSM8KStrategyQA | —Unverified | 0 |
| metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models | Jul 4, 2024 | ARCGSM8K | CodeCode Available | 0 |
| AgentInstruct: Toward Generative Teaching with Agentic Flows | Jul 3, 2024 | GSM8KMMLU | —Unverified | 0 |
| Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs | Jul 1, 2024 | DiversityGSM8K | —Unverified | 0 |
| Advancing Process Verification for Large Language Models via Tree-Based Preference Learning | Jun 29, 2024 | Binary ClassificationGSM8K | —Unverified | 0 |
| LiteSearch: Efficacious Tree Search for LLM | Jun 29, 2024 | GSM8KMathematical Reasoning | —Unverified | 0 |