| Expanding Search Space with Diverse Prompting Agents: An Efficient Sampling Approach for LLM Mathematical Reasoning | Oct 13, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics | Oct 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| A Systematic Survey on Large Language Models for Algorithm Design | Oct 11, 2024 | Mathematical Reasoningscientific discovery | —Unverified | 0 |
| SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights | Oct 11, 2024 | GSM8KMath | CodeCode Available | 4 |
| TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees | Oct 10, 2024 | Mathematical Reasoning | —Unverified | 0 |
| Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks | Oct 10, 2024 | 8kDiversity | —Unverified | 0 |
| Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models | Oct 10, 2024 | GSM8KMath | CodeCode Available | 2 |
| MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code | Oct 10, 2024 | MathMathematical Reasoning | CodeCode Available | 2 |
| Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models | Oct 10, 2024 | Arithmetic ReasoningMath | CodeCode Available | 0 |
| VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers | Oct 10, 2024 | Mathematical ReasoningQ-Learning | —Unverified | 0 |
| Herald: A Natural Language Annotated Lean 4 Dataset | Oct 9, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning | Oct 9, 2024 | Mathematical Reasoning | —Unverified | 0 |
| PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness | Oct 9, 2024 | Mathematical Reasoning | —Unverified | 0 |
| Subtle Errors Matter: Preference Learning via Error-injected Self-editing | Oct 9, 2024 | GSM8KMath | —Unverified | 0 |
| FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning | Oct 8, 2024 | GSM8KHallucination | —Unverified | 0 |
| LeanAgent: Lifelong Learning for Formal Theorem Proving | Oct 8, 2024 | Abstract AlgebraAutomated Theorem Proving | CodeCode Available | 2 |
| Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning | Oct 8, 2024 | Image RetrievalMath | —Unverified | 0 |
| Give me a hint: Can LLMs take a hint to solve math problems? | Oct 8, 2024 | Adversarial RobustnessMath | CodeCode Available | 0 |
| MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs | Oct 7, 2024 | Information RetrievalMathematical Reasoning | —Unverified | 0 |
| GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models | Oct 7, 2024 | GSM8KLogical Reasoning | CodeCode Available | 1 |
| Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark | Oct 6, 2024 | Mathematical ReasoningSpatial Reasoning | CodeCode Available | 0 |
| ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection | Oct 6, 2024 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement | Oct 6, 2024 | Mathematical ReasoningMeta-Learning | CodeCode Available | 2 |
| TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions | Oct 5, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| Table Question Answering for Low-resourced Indic Languages | Oct 4, 2024 | Cross-Lingual TransferMathematical Reasoning | CodeCode Available | 0 |