SOTAVerified

Mathematical Reasoning

Papers

Showing 301350 of 805 papers

TitleStatusHype
Large Language Models for Design Structure Matrix Optimization0
Towards Efficient and Effective Alignment of Large Language Models0
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens0
A Survey on Large Language Models for Mathematical Reasoning0
Can A Gamer Train A Mathematical Reasoning Model?Code0
VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward MechanismCode0
Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic0
Can Theoretical Physics Research Benefit from Language Agents?0
Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning0
Mathematical Reasoning for Unmanned Aerial Vehicles: A RAG-Based Approach for Complex Arithmetic ReasoningCode0
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning0
Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models0
ProRefine: Inference-time Prompt Refinement with Textual Feedback0
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos0
LogicPuzzleRL: Cultivating Robust Mathematical Reasoning in LLMs via Reinforcement LearningCode0
Adaptive Graph Pruning for Multi-Agent CommunicationCode0
WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks0
Uni-LoRA: One Vector is All You Need0
GThinker: Towards General Multimodal Reasoning via Cue-Guided RethinkingCode0
Speculative Reward Model Boosts Decision Making Ability of LLMs Cost-EffectivelyCode0
Evaluation of LLMs for mathematical problem solving0
RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual CompensationCode0
Scaling up the think-aloud methodCode0
Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt0
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness0
Diversity-Aware Policy Optimization for Large Language Model Reasoning0
Discriminative Policy Optimization for Token-Level Reward ModelsCode0
AutoGPS: Automated Geometry Problem Solving via Multimodal Formalization and Deductive Reasoning0
On-Policy RL with Optimal Reward Baseline0
Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability0
Probability-Consistent Preference Optimization for Enhanced LLM ReasoningCode0
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?Code0
Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models0
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical SupervisionCode0
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles0
Improving Multilingual Math Reasoning for African Languages0
HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation0
SituatedThinker: Grounding LLM Reasoning with Real-World through Situated ThinkingCode0
AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models0
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment0
MMATH: A Multilingual Benchmark for Mathematical ReasoningCode0
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math CompetitionsCode0
Efficient Long CoT Reasoning in Small Language Models0
LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning ChallengesCode0
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation0
Unraveling Misinformation Propagation in LLM ReasoningCode0
PPT: A Process-based Preference Learning Framework for Self Improving Table Question Answering Models0
Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence0
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs0
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models0
Show:102550
← PrevPage 7 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified