SOTAVerified

Mathematical Problem-Solving

Papers

Showing 125 of 106 papers

TitleStatusHype
EvoAgentX: An Automated Framework for Evolving Agentic WorkflowsCode7
LocationReasoner: Evaluating LLMs on Real-World Site Selection ReasoningCode0
TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving0
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM ReasoningCode1
Solving Inequality Proofs with Large Language ModelsCode1
Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code GenerationCode0
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal ReasoningCode1
PoLAR: Polar-Decomposed Low-Rank Adapter Representation0
Evaluation of LLMs for mathematical problem solving0
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?Code0
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical SupervisionCode0
Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth AnswersCode0
RaDeR: Reasoning-aware Dense Retrieval ModelsCode1
Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu0
SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving0
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems0
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate ClassCode0
Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations0
Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs0
PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt TuningCode0
Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem SolvingCode2
Reasoning Models Can Be Effective Without Thinking0
Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models0
On Vanishing Variance in Transformer Length Generalization0
LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models0
Show:102550
← PrevPage 1 of 5Next →

No leaderboard results yet.