SOTAVerified

Mathematical Reasoning

Papers

Showing 651700 of 805 papers

TitleStatusHype
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs0
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts0
Anomaly Detection of Tabular Data Using LLMs0
Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads0
Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language ModelsCode0
CodeGemma: Open Code Models Based on Gemma0
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning0
MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models0
ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models0
Robustness Assessment of Mathematical Reasoning in the Presence of Missing and Contradictory Conditions0
LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMsCode0
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models0
Improve Mathematical Reasoning in Language Models by Automated Process Supervision0
Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models0
NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language ModelsCode0
Pre-trained Large Language Models Use Fourier Features to Compute Addition0
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models0
Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data0
Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and CorrectionCode0
Efficient Model-agnostic Alignment via Bayesian Persuasion0
Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications0
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data0
Can LLMs Solve longer Math Word Problems Better?Code0
DOP: Diagnostic-Oriented Prompting for Large Language Models in Mathematical CorrectionCode0
A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks0
MathDivide: Improved mathematical reasoning by large language models0
Aligning Tutor Discourse Supporting Rigorous Thinking with Tutee Content Mastery for Predicting Math Achievement0
LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought0
A Careful Examination of Large Language Model Performance on Grade School Arithmetic0
Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions0
PARAMANU-GANITA: Language Model with Mathematical Capabilities0
Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language ModelsCode0
Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training0
iTBLS: A Dataset of Interactive Conversations Over Tabular Information0
Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy CompetitionCode0
SAAS: Solving Ability Amplification Strategy for Enhanced Mathematical Reasoning in Large Language Models0
Exploring the Mystery of Influential Data for Mathematical Reasoning0
Planning and Editing What You Retrieve for Enhanced Tool LearningCode0
Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeCode0
Dual Instruction Tuning with Large Language Models for Mathematical Reasoning0
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?0
Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection0
Instructing Large Language Models to Identify and Ignore Irrelevant ConditionsCode0
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety0
Apriori Knowledge in an Era of Computational Opacity: The Role of AI in Mathematical Discovery0
FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models0
Prompt Selection and Augmentation for Few Examples Code Generation in Large Language Model and its Application in Robotics Control0
Machine learning and information theory concepts towards an AI Mathematician0
Show:102550
← PrevPage 14 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified