USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
Jiarui Fang, Shangchun Zhao
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/feifeibear/long-context-attentionOfficialIn paperpytorch★ 652
- github.com/tencent-hunyuan/hunyuanvideopytorch★ 11,861
- github.com/tencent/hunyuanvideopytorch★ 11,860
- github.com/xdit-project/xditpytorch★ 2,572
- github.com/pipefusion/pipefusionpytorch★ 57
Abstract
Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K. Our code is publicly available at https://github.com/feifeibear/long-context-attention.