| When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems | Oct 16, 2024 | HallucinationMath | —Unverified | 0 |
| Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning | Oct 16, 2024 | AllGSM8K | CodeCode Available | 0 |
| LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks | Oct 16, 2024 | Mathparameter-efficient fine-tuning | CodeCode Available | 1 |
| JudgeBench: A Benchmark for Evaluating LLM-based Judges | Oct 16, 2024 | Math | CodeCode Available | 2 |
| MIND: Math Informed syNthetic Dialogues for Pretraining LLMs | Oct 15, 2024 | GSM8KMath | —Unverified | 0 |
| Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling | Oct 15, 2024 | Instruction FollowingKnowledge Distillation | —Unverified | 0 |
| One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks | Oct 14, 2024 | FairnessGSM8K | CodeCode Available | 0 |
| Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning | Oct 14, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| Innovative Thinking, Infinite Humor: Humor Research of Large Language Models through Structured Thought Leaps | Oct 14, 2024 | Math | —Unverified | 0 |
| CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning | Oct 14, 2024 | MathMathematical Reasoning | CodeCode Available | 1 |
| Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces | Oct 13, 2024 | Computational EfficiencyMath | —Unverified | 0 |
| HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics | Oct 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Expanding Search Space with Diverse Prompting Agents: An Efficient Sampling Approach for LLM Mathematical Reasoning | Oct 13, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models | Oct 12, 2024 | Mathreinforcement-learning | CodeCode Available | 5 |
| Testing GPT-4-o1-preview on math and science problems: A follow-up study | Oct 11, 2024 | MathSpatial Reasoning | —Unverified | 0 |
| Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization | Oct 11, 2024 | GSM8KLanguage Modeling | CodeCode Available | 2 |
| SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights | Oct 11, 2024 | GSM8KMath | CodeCode Available | 4 |
| The Geometry of Concepts: Sparse Autoencoder Feature Structure | Oct 10, 2024 | Math | CodeCode Available | 1 |
| VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models | Oct 10, 2024 | Math | CodeCode Available | 2 |
| Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models | Oct 10, 2024 | GSM8KMath | CodeCode Available | 2 |
| Cognitive Noise and Altruistic Preferences | Oct 10, 2024 | Math | —Unverified | 0 |
| Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models | Oct 10, 2024 | Arithmetic ReasoningMath | CodeCode Available | 0 |
| MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code | Oct 10, 2024 | MathMathematical Reasoning | CodeCode Available | 2 |
| Herald: A Natural Language Annotated Lean 4 Dataset | Oct 9, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders | Oct 9, 2024 | Math | —Unverified | 0 |