SOTAVerified

GSM8K

Papers

Showing 251275 of 439 papers

TitleStatusHype
Reliable Reasoning Beyond Natural Language0
Qwen2 Technical ReportCode13
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models0
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist0
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On0
When is the consistent prediction likely to be a correct prediction?0
LoRA-GA: Low-Rank Adaptation with Gradient ApproximationCode3
metabench -- A Sparse Benchmark to Measure General Ability in Large Language ModelsCode0
Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks0
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical ReasoningCode1
AgentInstruct: Toward Generative Teaching with Agentic Flows0
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs0
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical ReasoningCode1
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning0
LiteSearch: Efficacious Tree Search for LLM0
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsCode3
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
PORT: Preference Optimization on Reasoning Traces0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language FeedbackCode1
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning0
Can LLMs Reason in the Wild with Programs?Code0
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All ToolsCode14
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive PrinciplesCode1
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based SamplingCode1
Show:102550
← PrevPage 11 of 18Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified