SOTAVerified

GSM8K

Papers

Showing 401439 of 439 papers

TitleStatusHype
MathScale: Scaling Instruction Tuning for Mathematical ReasoningCode0
Activation Steering for Chain-of-Thought CompressionCode0
Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical RangesCode0
Text-to-LoRA: Instant Transformer AdaptionCode0
Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?Code0
metabench -- A Sparse Benchmark to Measure General Ability in Large Language ModelsCode0
DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt CompressionCode0
Adaptive Rectification Sampling for Test-Time Compute ScalingCode0
LogicPro: Improving Complex Logical Reasoning via Program-Guided LearningCode0
The Price of Format: Diversity Collapse in LLMsCode0
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank AdaptationCode0
EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context LearningCode0
LLM-TOPLA: Efficient LLM Ensemble by Maximising DiversityCode0
LLM2: Let Large Language Models Harness System 2 ReasoningCode0
COrAL: Order-Agnostic Language Modeling for Efficient Iterative RefinementCode0
Upweighting Easy Samples in Fine-Tuning Mitigates ForgettingCode0
Learning a Continue-Thinking Token for Enhanced Test-Time ScalingCode0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
SMART: Self-learning Meta-strategy Agent for Reasoning TasksCode0
Can LLMs Reason in the Wild with Programs?Code0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-SolvingCode0
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay PerspectiveCode0
In-Context Principle Learning from MistakesCode0
AlignedCoT: Prompting Large Language Models via Native-Speaking DemonstrationsCode0
TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination EvaluationCode0
TutorGym: A Testbed for Evaluating AI Agents as Tutors and StudentsCode0
How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning PerspectiveCode0
GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM DeploymentCode0
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration PitfallsCode0
Fill in the Blank: Exploring and Enhancing LLM Capabilities for Backward Reasoning in Math Word ProblemsCode0
DIVE: Diversified Iterative Self-ImprovementCode0
ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem SolvingCode0
Exploring LLM Reasoning Through Controlled Prompt VariationsCode0
Exploring Equation as a Better Intermediate Meaning Representation for Numerical ReasoningCode0
Distilling Reasoning Capabilities into Smaller Language ModelsCode0
AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware BudgetingCode0
Discriminative Policy Optimization for Token-Level Reward ModelsCode0
DiscQuant: A Quantization Method for Neural Networks Inspired by Discrepancy TheoryCode0
Show:102550
← PrevPage 9 of 9Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified