SOTAVerified

Mathematical Reasoning

Papers

Showing 551600 of 805 papers

TitleStatusHype
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models0
Multi-tool Integration Application for Math Reasoning Using Large Language Model0
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts0
MWPRanker: An Expression Similarity Based Math Word Problem Retriever0
Neuro-Symbolic Data Generation for Math Reasoning0
Noisy Deductive Reasoning: How Humans Construct Math, and How Math Constructs Universes0
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering0
Notes on a Path to AI Assistance in Mathematical Reasoning0
No Train Still Gain. Unleash Mathematical Reasoning of Large Language Models with Monte Carlo Tree Search Guided by Energy Function0
Novice Learner and Expert Tutor: Evaluating Math Reasoning Abilities of Large Language Models with Misconceptions0
NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks0
Olapa-MCoT: Enhancing the Chinese Mathematical Reasoning Capability of LLMs0
One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs0
On-Policy RL with Optimal Reward Baseline0
On the meaning of uncertainty for ethical AI: philosophy and practice0
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety0
Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation0
Optimizing Numerical Estimation and Operational Efficiency in the Legal Domain through Large Language Models0
Orca 2: Teaching Small Language Models How to Reason0
Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning0
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement0
Random Feedback Alignment Algorithms to train Neural Networks: Why do they Align?0
Real-Time Verification of Embodied Reasoning for Generative Skill Acquisition0
ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning0
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment0
Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models0
MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models0
Recognizing and Verifying Mathematical Equations using Multiplicative Differential Neural Units0
Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection0
Reliable and Efficient Amortized Model-based Evaluation0
Reliable Natural Language Understanding with Large Language Models and Answer Set Programming0
Reliable Reasoning Beyond Natural Language0
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs0
Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning0
Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot0
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness0
Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt0
Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation0
Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning0
Revisiting the Superficial Alignment Hypothesis0
RL-finetuning LLMs from on- and off-policy data with a single algorithm0
Robustness Assessment of Mathematical Reasoning in the Presence of Missing and Contradictory Conditions0
RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library0
S^3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners0
SAAS: Solving Ability Amplification Strategy for Enhanced Mathematical Reasoning in Large Language Models0
Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking0
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models0
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search0
SAT Solvers and Computer Algebra Systems: A Powerful Combination for Mathematics0
SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization0
Show:102550
← PrevPage 12 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5Search-o1Acc56.7Unverified
6s1-32BAcc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3GPT-4oAccuracy0.01Unverified
4o1-miniAccuracy0.01Unverified
5o1-previewAccuracy0.01Unverified
6Claude 3.5 SonnetAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified