SOTAVerified

Arithmetic Reasoning

Papers

Showing 150 of 175 papers

TitleStatusHype
DCR: Quantifying Data Contamination in LLMs EvaluationCode0
DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact VerificationCode0
FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design0
Mathematical Reasoning for Unmanned Aerial Vehicles: A RAG-Based Approach for Complex Arithmetic ReasoningCode0
Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL0
Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning0
Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits0
HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM SystemsCode1
OMAC: A Broad Optimization Framework for LLM-Based Multi-Agent CollaborationCode0
Fact-Consistency Evaluation of Text-to-SQL Generation for Business Intelligence Using Exaone 3.50
CAPO: Cost-Aware Prompt OptimizationCode2
ThoughtProbe: Classifier-Guided Thought Space Exploration Leveraging LLM Intrinsic Reasoning0
Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization FailureCode1
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training0
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?0
Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-TuningCode1
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights0
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding0
On Representational Dissociation of Language and Arithmetic in Large Language Models0
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?0
CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization0
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training0
Rethinking Addressing in Language Models via Contexualized Equivariant Positional EncodingCode1
DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models0
Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning0
Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs0
Hint Marginalization for Improved Reasoning in Large Language Models0
GaLore+: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection0
S^2FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity0
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning0
PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model0
Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced ReasoningCode0
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of HeuristicsCode1
FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation ModelsCode1
Language Imbalance Driven Rewarding for Multilingual Self-improvingCode1
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language ModelsCode0
Think Beyond Size: Adaptive Prompting for More Effective Reasoning0
Unlocking Structured Thinking in Language Models with Cognitive Prompting0
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction DataCode4
Small Language Models are Equation Reasoners0
3-in-1: 2D Rotary Adaptation for Efficient Finetuning, Efficient Batching and ComposabilityCode0
Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks0
Leveraging LLM Reasoning Enhances Personalized Recommender Systems0
Toward Adaptive Reasoning in Large Language Models with Thought RollbackCode1
Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together0
Qwen2 Technical ReportCode13
Self-training Language Models for Arithmetic ReasoningCode0
SBoRA: Low-Rank Adaptation with Regional Weight UpdatesCode0
Show:102550
← PrevPage 1 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude 3.5 Sonnet (HPT)Accuracy97.72Unverified
2DUP prompt upon GPT-4Accuracy97.1Unverified
3Qwen2-Math-72B-Instruct (greedy)Accuracy96.7Unverified
4SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)Accuracy96.4Unverified
5OpenMath2-Llama3.1-70B (majority@256)Accuracy96Unverified
6Jiutian-大模型Accuracy95.2Unverified
7DAMOMath-7B(MetaMath, OVM, BS, Ensemble)Accuracy95.1Unverified
8Claude 3 Opus (0-shot chain-of-thought)Accuracy95Unverified
9OpenMath2-Llama3.1-70BAccuracy94.9Unverified
10GPT-4 (Teaching-Inspired)Accuracy94.8Unverified
#ModelMetricClaimedVerifiedStatus
1Text-davinci-002 (175B)(zero-shot-cot)Accuracy78.7Unverified
2Text-davinci-002 (175B) (zero-shot)Accuracy17.7Unverified
#ModelMetricClaimedVerifiedStatus
1Tree of Thoughts (b=5)Success0.74Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy92.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy89.2Unverified