GSM8K

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 251–300 of 439 papers

Title	Date	Tasks	Status	Hype
Reliable Reasoning Beyond Natural Language	Jul 16, 2024	GSM8KMathematical Reasoning	—Unverified	0
Qwen2 Technical Report	Jul 15, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	13
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models	Jul 12, 2024	GSM8KMath	—Unverified	0
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist	Jul 11, 2024	GSM8KMath	—Unverified	0
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On	Jul 11, 2024	GSM8KMath	—Unverified	0
When is the consistent prediction likely to be a correct prediction?	Jul 8, 2024	GSM8KPrediction	—Unverified	0
LoRA-GA: Low-Rank Adaptation with Gradient Approximation	Jul 6, 2024	GSM8Kparameter-efficient fine-tuning	CodeCode Available	3
metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models	Jul 4, 2024	ARCGSM8K	CodeCode Available	0
Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks	Jul 4, 2024	GSM8KStrategyQA	—Unverified	0
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning	Jul 4, 2024	AvgGSM8K	CodeCode Available	1
AgentInstruct: Toward Generative Teaching with Agentic Flows	Jul 3, 2024	GSM8KMMLU	—Unverified	0
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs	Jul 1, 2024	DiversityGSM8K	—Unverified	0
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning	Jun 30, 2024	GSM8KMath	CodeCode Available	1
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning	Jun 29, 2024	Binary ClassificationGSM8K	—Unverified	0
LiteSearch: Efficacious Tree Search for LLM	Jun 29, 2024	GSM8KMathematical Reasoning	—Unverified	0
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs	Jun 26, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	3
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation	Jun 25, 2024	ARCBenchmarking	CodeCode Available	0
PORT: Preference Optimization on Reasoning Traces	Jun 23, 2024	ARCGSM8K	—Unverified	0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation	Jun 20, 2024	GSM8KLanguage Model Evaluation	CodeCode Available	0
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback	Jun 20, 2024	Binary ClassificationGSM8K	CodeCode Available	1
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning	Jun 20, 2024	GSM8KHeuristic Search	—Unverified	0
Can LLMs Reason in the Wild with Programs?	Jun 19, 2024	GSM8KMath	CodeCode Available	0
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools	Jun 18, 2024	AllGSM8K	CodeCode Available	14
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles	Jun 18, 2024	Arithmetic ReasoningCode Generation	CodeCode Available	1
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling	Jun 17, 2024	GSM8KMath	CodeCode Available	1
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation	Jun 16, 2024	Continual LearningGSM8K	CodeCode Available	0
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B	Jun 11, 2024	Decision MakingGSM8K	CodeCode Available	5
Uncertainty Aware Learning for Language Model Alignment	Jun 7, 2024	GSM8KLanguage Modeling	—Unverified	0
Improve Mathematical Reasoning in Language Models by Automated Process Supervision	Jun 5, 2024	GSM8KMath	—Unverified	0
Does your data spark joy? Performance gains from domain upsampling at the end of training	Jun 5, 2024	GSM8KHumanEval	—Unverified	0
Automatic Instruction Evolving for Large Language Models	Jun 2, 2024	GSM8KHumanEval	CodeCode Available	3
GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment	May 30, 2024	GSM8KKnowledge Distillation	CodeCode Available	0
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths	May 30, 2024	GSM8KHumanEval	—Unverified	0
Arithmetic Reasoning with LLM: Prolog Generation & Permutation	May 28, 2024	Arithmetic ReasoningData Augmentation	—Unverified	0
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters	May 27, 2024	BenchmarkingGSM8K	CodeCode Available	2
Multi-Reference Preference Optimization for Large Language Models	May 26, 2024	GSM8KTruthfulQA	—Unverified	0
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time	May 25, 2024	GSM8KMath	—Unverified	0
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training	May 23, 2024	GSM8KMixture-of-Experts	CodeCode Available	7
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	May 23, 2024	GPUGSM8K	CodeCode Available	1
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast	May 23, 2024	Computational EfficiencyGSM8K	CodeCode Available	1
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step	May 23, 2024	GSM8K	CodeCode Available	3
Multiple-Choice Questions are Efficient and Robust LLM Evaluators	May 20, 2024	GSM8KHumanEval	CodeCode Available	1
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark	May 20, 2024	College MathematicsGSM8K	CodeCode Available	2
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving	May 20, 2024	GSM8KMath	—Unverified	0
Meaning-Typed Programming: Language Abstraction and Runtime for Model-Integrated Applications	May 14, 2024	GSM8KMath	—Unverified	0
MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning	May 13, 2024	Data AugmentationGSM8K	CodeCode Available	3
MathDivide: Improved mathematical reasoning by large language models	May 12, 2024	GSM8KLogical Reasoning	—Unverified	0
MAmmoTH2: Scaling Instructions from the Web	May 6, 2024	ChatbotGSM8K	—Unverified	0
Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning	May 5, 2024	GSM8KMath	CodeCode Available	2
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning	May 1, 2024	ARCGSM8K	CodeCode Available	3

Show:10 25 50

← PrevPage 6 of 9Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Xolver	Accuracy	98.1	—	Unverified
2	Orange-mini	0-shot MRR	98	—	Unverified