PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment

2025-07-12Code Available0· sign in to hype

Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno

Code Available — Be the first to reproduce this paper.

Code

github.com/ody-trek/posellm
OfficialIn paper★ 4

Abstract

Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.

Tasks

Large Language Model Pose Estimation Zero-shot Generalization

PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment

Code

Abstract

Tasks

Reproductions