SOTAVerified

Mathematical Reasoning

Papers

Showing 101150 of 805 papers

TitleStatusHype
An Expression Tree Decoding Strategy for Mathematical Equation GenerationCode2
SOLO: A Single Transformer for Scalable Vision-Language ModelingCode2
Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem SolvingCode2
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained OptimizationCode2
Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUsCode2
Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?Code2
Scaling Language Models: Methods, Analysis & Insights from Training GopherCode2
Benchmarking Benchmark Leakage in Large Language ModelsCode2
MathPile: A Billion-Token-Scale Pretraining Corpus for MathCode2
LoRA: Low-Rank Adaptation of Large Language ModelsCode2
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math ReasoningCode2
Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical TextsCode2
ProcessBench: Identifying Process Errors in Mathematical ReasoningCode2
CoRT: Code-integrated Reasoning within ThinkingCode2
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical ReasoningCode2
Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal ExamplesCode2
A Survey of Deep Learning for Mathematical ReasoningCode2
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language ModelsCode2
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning ProcessCode2
Preference Optimization for Reasoning with Pseudo FeedbackCode2
Exploring the Compositional Deficiency of Large Language Models in Mathematical ReasoningCode2
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual ContextsCode2
Offline Reinforcement Learning for LLM Multi-Step ReasoningCode2
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language ModelsCode2
AtomThink: A Slow Thinking Framework for Multimodal Mathematical ReasoningCode2
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to ImitateCode2
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning PruningCode2
Efficient Reinforcement Finetuning via Adaptive Curriculum LearningCode2
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive TasksCode2
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-ImprovementCode2
LeanAgent: Lifelong Learning for Formal Theorem ProvingCode2
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal ModelsCode2
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain ScenariosCode2
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought ReasoningCode2
Multi-View Reasoning: Consistent Contrastive Learning for Math Word ProblemCode2
Optimizing Anytime Reasoning via Budget Relative Policy OptimizationCode2
Reformatted AlignmentCode2
Compression Represents Intelligence LinearlyCode2
Exploring the Limit of Outcome Reward for Learning Mathematical ReasoningCode2
Self-Training with Direct Preference Optimization Improves Chain-of-Thought ReasoningCode2
Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics LearningCode2
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?Code1
CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical ReasoningCode1
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human AnnotationsCode1
DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language ModelsCode1
Ada-Instruct: Adapting Instruction Generators for Complex ReasoningCode1
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical ReasoningCode1
MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction FusionCode1
Mathematical Capabilities of ChatGPTCode1
Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language ModelsCode1
Show:102550
← PrevPage 3 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5Search-o1Acc56.7Unverified
6s1-32BAcc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3GPT-4oAccuracy0.01Unverified
4o1-miniAccuracy0.01Unverified
5o1-previewAccuracy0.01Unverified
6Claude 3.5 SonnetAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified