Reasoning on a Budget: Miniaturizing DeepSeek R1 with SFT-GRPO Alignment for Instruction-Tuned LLMs
Esmaeil Narimissa
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/EsmaeilNarimissa/aws-sft-grpo-budget-llm-finetuneIn paperpytorch★ 19
Abstract
Large language models (LLMs) excel at general-purpose generation but often struggle with structured reasoning tasks. Recent methods like DeepSeek-R1 have shown that reinforcement learning with rule-based rewards can significantly enhance reasoning capabilities. However, reproducing such pipelines remains computationally intensive and inaccessible to most researchers. In this work, we present a modular, low-cost replication of the DeepSeek-R1 training methodology using Qwen2.5-0.5B-Instruct (a compact instruction-tuned LLM) optimized via a two-stage pipeline: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). SFT aligns the base model to reasoning-structured prompts using LoRA-based parameter-efficient fine-tuning. GRPO then refines this policy using a critic-free reinforcement learning algorithm guided by five composable reward functions, including accuracy, reasoning presence, and formatting compliance. The entire training process was executed for under $100 USD on AWS SageMaker, demonstrating that high-impact reasoning alignment is achievable without large-scale compute. Quantitative metrics confirm strong convergence, high reward stability, and consistent output structure. This study contributes a scalable and reproducible template for aligning compact LLMs to reasoning-intensive tasks under constrained computational budgets.