SOTAVerified

Mathematical Reasoning

Papers

Showing 351400 of 805 papers

TitleStatusHype
KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM InferenceCode0
Accelerate Parallelizable Reasoning via Parallel Decoding within One SequenceCode0
Pride and Prejudice: LLM Amplifies Self-Bias in Self-RefinementCode0
On-Policy RL with Optimal Reward BaselineCode0
Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMsCode0
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math ReasoningCode0
NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI TutorsCode0
NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language ModelsCode0
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward ModelCode0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?Code0
Integrate the Essence and Eliminate the Dross: Fine-Grained Self-Consistency for Free-Form Language GenerationCode0
Instructing Large Language Models to Identify and Ignore Irrelevant ConditionsCode0
Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic CollaborationCode0
MMATH: A Multilingual Benchmark for Mathematical ReasoningCode0
An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical ReasoningCode0
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPOCode0
MoD: A Distribution-Based Approach for Merging Large Language ModelsCode0
MultiLingPoT: Enhancing Mathematical Reasoning with Multilingual Program Fine-tuningCode0
Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative QueryingCode0
MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine TreeCode0
Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical ReasoningCode0
Compositional Generalization with Tree Stack Memory UnitsCode0
How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning PerspectiveCode0
Math Word Problem Solving by Generating Linguistic Variants of Problem StatementsCode0
How Do Humans Write Code? Large Models Do It the Same Way TooCode0
Analysing Mathematical Reasoning Abilities of Neural ModelsCode0
MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical ReasoningCode0
MathScale: Scaling Instruction Tuning for Mathematical ReasoningCode0
MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical BenchmarkCode0
MCC-KD: Multi-CoT Consistent Knowledge DistillationCode0
Linguistic Generalizability of Test-Time Scaling in Mathematical ReasoningCode0
Multilingual Mathematical Reasoning: Advancing Open-Source LLMs in Hindi and EnglishCode0
Hierarchical Attention Generates Better ProofsCode0
Adaptive Graph Pruning for Multi-Agent CommunicationCode0
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate ClassCode0
Guided Stream of Search: Learning to Better Search with Language Models via Optimal Path GuidanceCode0
Mathematical Reasoning for Unmanned Aerial Vehicles: A RAG-Based Approach for Complex Arithmetic ReasoningCode0
Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical RangesCode0
GThinker: Towards General Multimodal Reasoning via Cue-Guided RethinkingCode0
Compositional Processing Emerges in Neural Networks Solving Math ProblemsCode0
ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector AttentionCode0
Give me a hint: Can LLMs take a hint to solve math problems?Code0
CoinMath: Harnessing the Power of Coding Instruction for Math LLMsCode0
ATHENA: Mathematical Reasoning with Thought ExpansionCode0
Code Soliloquies for Accurate Calculations in Large Language ModelsCode0
MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data UncertaintyCode0
MARGE: Improving Math Reasoning for LLMs with Guided ExplorationCode0
Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?Code0
Gap-Filling Prompting Enhances Code-Assisted Mathematical ReasoningCode0
Show:102550
← PrevPage 8 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5Search-o1Acc56.7Unverified
6s1-32BAcc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3GPT-4oAccuracy0.01Unverified
4o1-miniAccuracy0.01Unverified
5o1-previewAccuracy0.01Unverified
6Claude 3.5 SonnetAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified