SOTAVerified

Mathematical Reasoning

Papers

Showing 401450 of 805 papers

TitleStatusHype
WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks0
What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning0
Why are NLP Models Fumbling at Elementary Math? A Survey of Automatic Word Problem Solvers0
Why are NLP Models Fumbling at Elementary Math? A Survey of Deep Learning based Word Problem Solvers0
WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications0
1bit-Merging: Dynamic Quantized Merging for Large Language Models0
You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism0
MathLearner: A Large Language Model Agent Framework for Learning to Solve Mathematical Problems0
AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum0
A Careful Examination of Large Language Model Performance on Grade School Arithmetic0
Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets0
A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting0
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment0
AdapThink: Adaptive Thinking Preferences for Reasoning Language Model0
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning0
Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages0
Adventures in Mathematical Reasoning0
Agent-as-a-Service based on Agent Network0
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning0
A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions0
AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models0
Aligning Tutor Discourse Supporting Rigorous Thinking with Tutee Content Mastery for Predicting Math Achievement0
Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN0
Anomaly Detection of Tabular Data Using LLMs0
Applications of Positive Unlabeled (PU) and Negative Unlabeled (NU) Learning in Cybersecurity0
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs0
Apriori Knowledge in an Era of Computational Opacity: The Role of AI in Mathematical Discovery0
Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?0
Assessing GPT4-V on Structured Reasoning Tasks0
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering0
Assessing Robustness to Spurious Correlations in Post-Training Language Models0
Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models0
Assessing the Impact of Prompting Methods on ChatGPT's Mathematical Capabilities0
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges0
A Survey on Large Language Models for Mathematical Reasoning0
A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers0
A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks0
A Systematic Survey on Large Language Models for Algorithm Design0
A Technical Study into Small Reasoning Language Models0
Augmenting In-Context-Learning in LLMs via Automatic Data Labeling and Refinement0
AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding0
AutoGPS: Automated Geometry Problem Solving via Multimodal Formalization and Deductive Reasoning0
AutoMathKG: The automated mathematical knowledge graph based on LLM and vector database0
Forward-Backward Reasoning in Large Language Models for Mathematical Verification0
Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications0
Benchmarking Large Language Models via Random Variables0
Benchmarking Large Language Models with Integer Sequence Generation Tasks0
Better Process Supervision with Bi-directional Rewarding Signals0
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning0
Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning0
Show:102550
← PrevPage 9 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified