SOTAVerified

GSM8K

Papers

Showing 351375 of 439 papers

TitleStatusHype
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
PORT: Preference Optimization on Reasoning Traces0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning0
Can LLMs Reason in the Wild with Programs?Code0
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank AdaptationCode0
Uncertainty Aware Learning for Language Model Alignment0
Does your data spark joy? Performance gains from domain upsampling at the end of training0
Improve Mathematical Reasoning in Language Models by Automated Process Supervision0
GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM DeploymentCode0
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths0
Arithmetic Reasoning with LLM: Prolog Generation & Permutation0
Multi-Reference Preference Optimization for Large Language Models0
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time0
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving0
Meaning-Typed Programming: Language Abstraction and Runtime for Model-Integrated Applications0
MathDivide: Improved mathematical reasoning by large language models0
MAmmoTH2: Scaling Instructions from the Web0
A Careful Examination of Large Language Model Performance on Grade School Arithmetic0
Iterative Reasoning Preference Optimization0
PARAMANU-GANITA: Language Model with Mathematical Capabilities0
Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?0
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models0
Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning0
Automatic Prompt Selection for Large Language Models0
Show:102550
← PrevPage 15 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified