SOTAVerified

Mathematical Reasoning

Papers

Showing 451500 of 805 papers

TitleStatusHype
Guided Stream of Search: Learning to Better Search with Language Models via Optimal Path GuidanceCode0
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical ReasoningCode5
GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning0
CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning0
Evaluating Robustness of Reward Models for Mathematical Reasoning0
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction DataCode4
Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models0
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-ProblemsCode0
INC-Math: Integrating Natural Language and Code for Enhanced Mathematical Reasoning in Large Language Models0
Revisiting the Superficial Alignment Hypothesis0
HM3: Hierarchical Multi-Objective Model Merging for Pretrained Models0
Evaluation of OpenAI o1: Opportunities and Challenges of AGI0
PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularizationCode1
LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ0
ControlMath: Controllable Data Generation Promotes Math Generalist Models0
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form PlanningCode1
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning0
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement0
RoMath: A Mathematical Reasoning Benchmark in RomanianCode0
Causal Inference with Large Language Model: A Survey0
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks0
Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding0
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model0
Mathematical Formalized Problem Solving and Theorem Proving in Different Fields in Lean 4Code0
Diagram Formalization Enhanced Multi-Modal Geometry Problem SolverCode1
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks0
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal ModelsCode2
Building Math Agents with Multi-Turn Iterative Preference Learning0
S^3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners0
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language ModelsCode1
Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems0
AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding0
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models0
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation0
Path-Consistency: Prefix Enhancement for Efficient Inference in LLM0
Tangram: Benchmark for Evaluating Geometric Element Recognition in Large Multimodal Models0
Multi-tool Integration Application for Math Reasoning Using Large Language Model0
SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding0
Taming Generative Diffusion Prior for Universal Blind Image Restoration0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting0
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical ReasoningCode1
MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical BenchmarkCode0
MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data UncertaintyCode0
Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight DisentanglementCode1
MathLearner: A Large Language Model Agent Framework for Learning to Solve Mathematical Problems0
AI-Assisted Generation of Difficult Math QuestionsCode0
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning ProcessCode2
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian LanguagesCode2
Optimizing Numerical Estimation and Operational Efficiency in the Legal Domain through Large Language Models0
Show:102550
← PrevPage 10 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5Search-o1Acc56.7Unverified
6s1-32BAcc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3GPT-4oAccuracy0.01Unverified
4o1-miniAccuracy0.01Unverified
5o1-previewAccuracy0.01Unverified
6Claude 3.5 SonnetAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified