SOTAVerified

Reasoning on a Budget: Miniaturizing DeepSeek R1 with SFT-GRPO Alignment for Instruction-Tuned LLMs

2025-05-16techrxiv 2025Code Available1· sign in to hype

Esmaeil Narimissa

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Large language models (LLMs) excel at general-purpose generation but often struggle with structured reasoning tasks. Recent methods like DeepSeek-R1 have shown that reinforcement learning with rule-based rewards can significantly enhance reasoning capabilities. However, reproducing such pipelines remains computationally intensive and inaccessible to most researchers. In this work, we present a modular, low-cost replication of the DeepSeek-R1 training methodology using Qwen2.5-0.5B-Instruct (a compact instruction-tuned LLM) optimized via a two-stage pipeline: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). SFT aligns the base model to reasoning-structured prompts using LoRA-based parameter-efficient fine-tuning. GRPO then refines this policy using a critic-free reinforcement learning algorithm guided by five composable reward functions, including accuracy, reasoning presence, and formatting compliance. The entire training process was executed for under $100 USD on AWS SageMaker, demonstrating that high-impact reasoning alignment is achievable without large-scale compute. Quantitative metrics confirm strong convergence, high reward stability, and consistent output structure. This study contributes a scalable and reproducible template for aligning compact LLMs to reasoning-intensive tasks under constrained computational budgets.

Tasks

Reproductions