SOTAVerified

Mathematical Reasoning

Papers

Showing 351400 of 805 papers

TitleStatusHype
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision0
LLMs can implicitly learn from mistakes in-context0
LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought0
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement0
Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory0
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics0
math-PVS: A Large Language Model Framework to Map Scientific Publications to PVS Theories0
Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning0
Eliciting Reasoning in Language Models with Cognitive Tools0
Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning0
LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ0
LiteSearch: Efficacious Tree Search for LLM0
Efficient Tool Use with Chain-of-Abstraction Reasoning0
Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets0
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?0
MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs0
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation0
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model0
Efficient Long CoT Reasoning in Small Language Models0
LexPam: Legal Procedure Awareness-Guided Mathematical Reasoning0
Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning0
MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams0
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs0
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning0
Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning0
Let's Reinforce Step by Step0
Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability0
Agent-as-a-Service based on Agent Network0
LemmaHead: RAG Assisted Proof Generation Using Large Language Models0
Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning0
Apriori Knowledge in an Era of Computational Opacity: The Role of AI in Mathematical Discovery0
Efficient Model-agnostic Alignment via Bayesian Persuasion0
Learning to Reason With Relational Abstractions0
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision0
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs0
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models0
BitNet b1.58 2B4T Technical Report0
LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation0
LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems0
LLM Library Learning Fails: A LEGO-Prover Case Study0
Learning to chain-of-thought with Jensen's evidence lower bound0
LLM Reasoning Engine: Specialized Training for Enhanced Mathematical Reasoning0
Dual Instruction Tuning with Large Language Models for Mathematical Reasoning0
LLMs can be easily Confused by Instructional Distractions0
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs0
Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation0
Learning by Applying: A General Framework for Mathematical Reasoning via Enhancing Explicit Knowledge Learning0
DavIR: Data Selection via Implicit Reward for Large Language Models0
DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models0
Mathematical Reasoning in Latent Space0
Show:102550
← PrevPage 8 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified