A Unified Generalization Framework for Model Merging: Trade-offs, Non-Linearity, and Scaling Laws
Qinglun Li, Anke Tang, Miao Zhang, Mengzhu Wang, Quanjun Yin, Li Shen
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one, operating purely in parameter space without original data or expensive re-computation. Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyperparameters (e.g., varying learning rates, batch sizes) remains missing. Existing federated learning theories focus purely on optimization, which fails to explain model merging and inherently leads to theoretical paradoxes. To address this challenge, we pioneer the integration of L_2-Stability theory into heterogeneous environments to rigorously decouple the excess risk of the merged model x_avg into optimization and generalization errors. This comprehensive analysis yields three main contributions: (i) We mathematically establish the fundamental Optimization-Generalization Trade-off, explicitly resolving the paradox of why over-trained experts lead to catastrophic merging collapse. (ii) A unified theoretical framework is provided to explain not only linear merging algorithms (e.g., TA, AdaMerging) but also state-of-the-art non-linear merging algorithms (e.g., TIES, DARE), proving how sparsification operators strictly tighten the generalization bound by suppressing task heterogeneity. (iii) Rather than heuristic guidelines, we derive Quantitative Scaling Laws that theoretically predict the precise impact of hyperparameter choices, enabling practitioners to strategically construct ``merge-friendly'' experts. Extensive experiments on the ResNet and ViT architectures across 20 visual classification tasks, involving thousands of finetuning models, robustly confirm that our theoretical scaling laws accurately predict the empirical generalization behaviors of x_avg.