SOTAVerified

Mathematical Problem-Solving

Papers

Showing 2650 of 106 papers

TitleStatusHype
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal ReasoningCode1
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark DatasetsCode1
Forgotten Polygons: Multimodal Large Language Models are Shape-BlindCode1
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM ReasoningCode1
Training and Evaluating Language Models with Template-based Data GenerationCode1
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language ModelsCode1
MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction FusionCode1
Solving Inequality Proofs with Large Language ModelsCode1
VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language ModelsCode1
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn InteractionsCode1
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human CurriculaCode1
Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation CapabilitiesCode1
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical SupervisionCode0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code GenerationCode0
Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical StudyCode0
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?Code0
PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt TuningCode0
Data Contamination Through the Lens of TimeCode0
SEGO: Sequential Subgoal Optimization for Mathematical Problem-SolvingCode0
Mathify: Evaluating Large Language Models on Mathematical Problem Solving TasksCode0
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate ClassCode0
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace TheoryCode0
Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeCode0
A Survey on Mathematical Reasoning and Optimization with Large Language ModelsCode0
Show:102550
← PrevPage 2 of 5Next →

No leaderboard results yet.