SOTAVerified

Mathematical Reasoning

Papers

Showing 301350 of 805 papers

TitleStatusHype
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math CompetitionsCode0
Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language ModelsCode0
Reasoning with Transformer-based Models: Deep Learning, but Shallow ReasoningCode0
RoMath: A Mathematical Reasoning Benchmark in RomanianCode0
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language ModelsCode0
Bridging the Reasoning Gap: Small LLMs Can Plan with Generalised StrategiesCode0
Agentic-R1: Distilled Dual-Strategy ReasoningCode0
PSPO*: An Effective Process-supervised Policy Optimization for Reasoning AlignmentCode0
Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought ProcessesCode0
SWI: Speaking with Intent in Large Language ModelsCode0
Position: AI Evaluation Should Learn from How We Test HumansCode0
Polymath: A Challenging Multi-modal Mathematical Reasoning BenchmarkCode0
Planning and Editing What You Retrieve for Enhanced Tool LearningCode0
OmniRouter: Budget and Performance Controllable Multi-LLM RoutingCode0
Pride and Prejudice: LLM Amplifies Self-Bias in Self-RefinementCode0
Blank Collapse: Compressing CTC emission for the faster decodingCode0
Reasoning over Uncertain Text by Generative Large Language ModelsCode0
Overcoming Barriers to Skill Injection in Language Modeling: Case Study in ArithmeticCode0
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You ThinkCode0
DOP: Diagnostic-Oriented Prompting for Large Language Models in Mathematical CorrectionCode0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Probability-Consistent Preference Optimization for Enhanced LLM ReasoningCode0
Scaling Reasoning can Improve Factuality in Large Language ModelsCode0
Do LLM Evaluators Prefer Themselves for a Reason?Code0
Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMsCode0
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math ReasoningCode0
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTSCode0
NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI TutorsCode0
Discriminative Policy Optimization for Token-Level Reward ModelsCode0
Multilingual Mathematical Reasoning: Advancing Open-Source LLMs in Hindi and EnglishCode0
Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation LearningCode0
Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic CollaborationCode0
KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM InferenceCode0
Accelerate Parallelizable Reasoning via Parallel Decoding within One SequenceCode0
MultiLingPoT: Enhancing Mathematical Reasoning with Multilingual Program Fine-tuningCode0
On-Policy RL with Optimal Reward BaselineCode0
MoD: A Distribution-Based Approach for Merging Large Language ModelsCode0
MMATH: A Multilingual Benchmark for Mathematical ReasoningCode0
Compositional Generalization with Tree Stack Memory UnitsCode0
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPOCode0
Procedural Knowledge in Pretraining Drives Reasoning in Large Language ModelsCode0
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward ModelCode0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?Code0
MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical ReasoningCode0
NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language ModelsCode0
Integrate the Essence and Eliminate the Dross: Fine-Grained Self-Consistency for Free-Form Language GenerationCode0
Large Language Models for Mathematical AnalysisCode0
Instructing Large Language Models to Identify and Ignore Irrelevant ConditionsCode0
Math Word Problem Solving by Generating Linguistic Variants of Problem StatementsCode0
Show:102550
← PrevPage 7 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified