A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation
Edward Effendy, Kuan-Wei Tseng, Rei Kawakami
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/grgward108/PosePretrainIn paperpytorch★ 0
Abstract
Accepted in the ICIP 2025 We present a novel transformer-based framework for whole-body grasping that addresses both pose generation and motion infilling, enabling realistic and stable object interactions. Our pipeline comprises three stages: Grasp Pose Generation for full-body grasp generation, Temporal Infilling for smooth motion continuity, and a LiftUp Transformer that refines downsampled joints back to high-resolution markers. To overcome the scarcity of hand-object interaction data, we introduce a data-efficient Generalized Pretraining stage on large, diverse motion datasets, yielding robust spatio-temporal representations transferable to grasping tasks. Experiments on the GRAB dataset show that our method outperforms state-of-the-art baselines in terms of coherence, stability, and visual realism. The modular design also supports easy adaptation to other human-motion applications.