SOTAVerified

Arithmetic Reasoning

Papers

Showing 101150 of 175 papers

TitleStatusHype
ChatGPT as a Math Questioner? Evaluating ChatGPT on Generating Pre-university Math QuestionsCode0
Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?Code0
Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic FeedbackCode0
Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic ReasoningCode0
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language ModelsCode0
Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced ReasoningCode0
LLM Augmented LLMs: Expanding Capabilities through CompositionCode0
Self-training Language Models for Arithmetic ReasoningCode0
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training0
Leveraging LLM Reasoning Enhances Personalized Recommender Systems0
Arithmetic Reasoning with LLM: Prolog Generation & Permutation0
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering0
Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning0
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM0
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models0
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?0
CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization0
Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models0
Composing Ensembles of Pre-trained Models via Iterative Consensus0
DiversiGATE: A Comprehensive Framework for Reliable Large Language Models0
DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models0
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment0
Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting0
Exploring Group and Symmetry Principles in Large Language Models0
Fact-Consistency Evaluation of Text-to-SQL Generation for Business Intelligence Using Exaone 3.50
Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together0
FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design0
GaLore+: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection0
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes0
Hint Marginalization for Improved Reasoning in Large Language Models0
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights0
Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning0
KwaiYiiMath: Technical Report0
Large Language Models are Null-Shot Learners0
Large Language Models Can Self-Correct with Key Condition Verification0
Large Language Models Can Self-Improve0
Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond0
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models0
Model Card and Evaluations for Claude Models0
Neural-Symbolic Recursive Machine for Systematic Generalization0
NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks0
On Representational Dissociation of Language and Arithmetic in Large Language Models0
Making Large Language Models Better Reasoners with Step-Aware Verifier0
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data0
Orca 2: Teaching Small Language Models How to Reason0
Orca-Math: Unlocking the potential of SLMs in Grade School Math0
PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model0
Prompt Sketching for Large Language Models0
RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought0
Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks0
Show:102550
← PrevPage 3 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude 3.5 Sonnet (HPT)Accuracy97.72Unverified
2DUP prompt upon GPT-4Accuracy97.1Unverified
3Qwen2-Math-72B-Instruct (greedy)Accuracy96.7Unverified
4SFT-Mistral-7B (Metamath, OVM, Smart Ensemble)Accuracy96.4Unverified
5OpenMath2-Llama3.1-70B (majority@256)Accuracy96Unverified
6Jiutian-大模型Accuracy95.2Unverified
7DAMOMath-7B(MetaMath, OVM, BS, Ensemble)Accuracy95.1Unverified
8Claude 3 Opus (0-shot chain-of-thought)Accuracy95Unverified
9OpenMath2-Llama3.1-70BAccuracy94.9Unverified
10GPT-4 (Teaching-Inspired)Accuracy94.8Unverified
#ModelMetricClaimedVerifiedStatus
1Text-davinci-002 (175B)(zero-shot-cot)Accuracy78.7Unverified
2Text-davinci-002 (175B) (zero-shot)Accuracy17.7Unverified
#ModelMetricClaimedVerifiedStatus
1Tree of Thoughts (b=5)Success0.74Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy92.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4 (Teaching-Inspired)Accuracy89.2Unverified