SOTAVerified

Mathematical Problem-Solving

Papers

Showing 2650 of 106 papers

TitleStatusHype
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language ModelsCode1
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMsCode1
VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language ModelsCode1
Training and Evaluating Language Models with Template-based Data GenerationCode1
Non-myopic Generation of Language Models for Reasoning and PlanningCode1
BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree SearchCode1
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human CurriculaCode1
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn InteractionsCode1
Insights into Alignment: Evaluating DPO and its Variants Across Multiple TasksCode1
Evaluating Language Models for Mathematics through InteractionsCode1
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark DatasetsCode1
Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in TransformersCode1
LocationReasoner: Evaluating LLMs on Real-World Site Selection ReasoningCode0
TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving0
Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code GenerationCode0
PoLAR: Polar-Decomposed Low-Rank Adapter Representation0
Evaluation of LLMs for mathematical problem solving0
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?Code0
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical SupervisionCode0
Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth AnswersCode0
Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu0
SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving0
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems0
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate ClassCode0
Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs0
Show:102550
← PrevPage 2 of 5Next →

No leaderboard results yet.