SOTAVerified

SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

2026-03-13Code Available2· sign in to hype

Shuang Liang, Jing He, Chuanmeizhi Wang, Lejun Liao, Guo Zhang, Yingcong Chen, Yuan Yuan

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Pre-trained diffusion models provide rich latent features across U-Net levels and are emerging as powerful vision backbones. While prior works such as Marigold and Lotus repurpose diffusion priors for dense geometric perception tasks such as depth and surface normal estimation, their potential for cross-domain human pose estimation remains largely unexplored. Through a systematic analysis of latent features from different upsampling levels of the Stable Diffusion U-Net, we identify the levels that deliver the strongest robustness and cross-domain generalization for pose estimation. Building on these findings, we propose SDPose, which (i) extracts U-Net features from the selected upsampling blocks, (ii) fuses them with a lightweight feature aggregation module to form a robust representation, and (iii) jointly optimizes keypoint heatmap supervision with an auxiliary latent reconstruction loss to regularize training and preserve the pre-trained generative prior. To evaluate cross-domain generalization and robustness, we construct COCO-OOD, a COCO-based benchmark with four subsets: three style-transferred splits to assess domain shift, and one corruption split (noise, weather, digital artifacts, and blur) to test robustness. With a shorter fine-tuning schedule, SDPose achieves performance comparable to Sapiens on COCO, surpasses Sapiens-1B on COCO-WholeBody, and establishes new state-of-the-art results on HumanArt and COCO-OOD.

Reproductions