SOTAVerified

Arithmetic Reasoning

Papers

Showing 125 of 175 papers

TitleStatusHype
DCR: Quantifying Data Contamination in LLMs EvaluationCode0
DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact VerificationCode0
FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design0
Mathematical Reasoning for Unmanned Aerial Vehicles: A RAG-Based Approach for Complex Arithmetic ReasoningCode0
Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL0
Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning0
Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits0
HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM SystemsCode1
OMAC: A Broad Optimization Framework for LLM-Based Multi-Agent CollaborationCode0
Fact-Consistency Evaluation of Text-to-SQL Generation for Business Intelligence Using Exaone 3.50
CAPO: Cost-Aware Prompt OptimizationCode2
ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning0
Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization FailureCode1
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training0
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?0
Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-TuningCode1
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights0
On Representational Dissociation of Language and Arithmetic in Large Language Models0
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding0
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?0
CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization0
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training0
Rethinking Addressing in Language Models via Contexualized Equivariant Positional EncodingCode1
Show:102550
← PrevPage 1 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude 3.5 Sonnet (HPT)Accuracy97.72Unverified
2DUP prompt upon GPT-4Accuracy97.1Unverified
3Qwen2-Math-72B-Instruct (greedy)Accuracy96.7Unverified
4SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)Accuracy96.4Unverified
5OpenMath2-Llama3.1-70B (majority@256)Accuracy96Unverified
6Jiutian-大模型Accuracy95.2Unverified
7DAMOMath-7B(MetaMath, OVM, BS, Ensemble)Accuracy95.1Unverified
8Claude 3 Opus (0-shot chain-of-thought)Accuracy95Unverified
9OpenMath2-Llama3.1-70BAccuracy94.9Unverified
10GPT-4 (Teaching-Inspired)Accuracy94.8Unverified
#ModelMetricClaimedVerifiedStatus
1Text-davinci-002 (175B)(zero-shot-cot)Accuracy78.7Unverified
2Text-davinci-002 (175B) (zero-shot)Accuracy17.7Unverified
#ModelMetricClaimedVerifiedStatus
1Tree of Thoughts (b=5)Success0.74Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy92.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy89.2Unverified