SOTAVerified

Arithmetic Reasoning

Papers

Showing 101125 of 175 papers

TitleStatusHype
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?0
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights0
On Representational Dissociation of Language and Arithmetic in Large Language Models0
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding0
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?0
CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization0
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training0
DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models0
Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning0
Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs0
Hint Marginalization for Improved Reasoning in Large Language Models0
GaLore+: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection0
S^2FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity0
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning0
PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model0
Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced ReasoningCode0
Think Beyond Size: Adaptive Prompting for More Effective Reasoning0
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language ModelsCode0
Unlocking Structured Thinking in Language Models with Cognitive Prompting0
Small Language Models are Equation Reasoners0
3-in-1: 2D Rotary Adaptation for Efficient Finetuning, Efficient Batching and ComposabilityCode0
Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks0
Leveraging LLM Reasoning Enhances Personalized Recommender Systems0
Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together0
Self-training Language Models for Arithmetic ReasoningCode0
Show:102550
← PrevPage 5 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude 3.5 Sonnet (HPT)Accuracy97.72Unverified
2DUP prompt upon GPT-4Accuracy97.1Unverified
3Qwen2-Math-72B-Instruct (greedy)Accuracy96.7Unverified
4SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)Accuracy96.4Unverified
5OpenMath2-Llama3.1-70B (majority@256)Accuracy96Unverified
6Jiutian-大模型Accuracy95.2Unverified
7DAMOMath-7B(MetaMath, OVM, BS, Ensemble)Accuracy95.1Unverified
8Claude 3 Opus (0-shot chain-of-thought)Accuracy95Unverified
9OpenMath2-Llama3.1-70BAccuracy94.9Unverified
10GPT-4 (Teaching-Inspired)Accuracy94.8Unverified
#ModelMetricClaimedVerifiedStatus
1Text-davinci-002 (175B)(zero-shot-cot)Accuracy78.7Unverified
2Text-davinci-002 (175B) (zero-shot)Accuracy17.7Unverified
#ModelMetricClaimedVerifiedStatus
1Tree of Thoughts (b=5)Success0.74Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy92.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy89.2Unverified