SOTAVerified

Mathematical Reasoning

Papers

Showing 601650 of 805 papers

TitleStatusHype
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?0
Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection0
Instructing Large Language Models to Identify and Ignore Irrelevant ConditionsCode0
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety0
Apriori Knowledge in an Era of Computational Opacity: The Role of AI in Mathematical Discovery0
FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models0
Prompt Selection and Augmentation for Few Examples Code Generation in Large Language Model and its Application in Robotics Control0
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon GenerationCode3
Machine learning and information theory concepts towards an AI Mathematician0
MathScale: Scaling Instruction Tuning for Mathematical ReasoningCode0
Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language ModelsCode1
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning0
You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism0
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models0
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem SolversCode2
Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models0
MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical ReasoningCode0
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs0
Stepwise Self-Consistent Mathematical Reasoning with Large Language ModelsCode1
How Do Humans Write Code? Large Models Do It the Same Way TooCode0
Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models0
Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought ProcessesCode0
Measuring Multimodal Mathematical Reasoning with MATH-Vision DatasetCode2
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language ModelsCode1
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models0
Learning to Check: Unleashing Potentials for Self-Correction in Large Language ModelsCode1
Reformatted AlignmentCode2
Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as AgentsCode1
Pride and Prejudice: LLM Amplifies Self-Bias in Self-RefinementCode0
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering0
When is Tree Search Useful for LLM Planning? It Depends on the DiscriminatorCode2
Reasoning over Uncertain Text by Generative Large Language ModelsCode0
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof DataCode1
Fourier Circuits in Neural Networks and Transformers: A Case Study of Modular Arithmetic with Multiple Inputs0
Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical TextsCode2
Can Graph Descriptive Order Affect Solving Graph Problems with LLMs?0
Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models0
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsCode9
Large Language Models for Mathematical Reasoning: Progresses and Challenges0
Large Multi-Modal Models (LMMs) as Universal Foundation Models for AI-Native Wireless Systems0
Efficient Tool Use with Chain-of-Abstraction Reasoning0
GAPS: Geometry-Aware Problem Solver0
EAGLE: Speculative Sampling Requires Rethinking Feature UncertaintyCode7
Demystifying Chains, Trees, and Graphs of Thoughts0
Distilling Mathematical Reasoning Capabilities into Small Language Models0
SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in ChineseCode2
LangBridge: Multilingual Reasoning Without Multilingual SupervisionCode2
Knowledge Fusion of Large Language ModelsCode4
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided InterventionsCode1
Augmenting Math Word Problems via Iterative Question ComposingCode1
Show:102550
← PrevPage 13 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified