USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

2024-05-13Code Available11· sign in to hype

Jiarui Fang, Shangchun Zhao

Code Available — Be the first to reproduce this paper.

Code

github.com/feifeibear/long-context-attention
OfficialIn paperpytorch★ 652
github.com/tencent-hunyuan/hunyuanvideo
pytorch★ 11,861
github.com/tencent/hunyuanvideo
pytorch★ 11,860
github.com/xdit-project/xdit
pytorch★ 2,572
github.com/pipefusion/pipefusion
pytorch★ 57

Abstract

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K. Our code is publicly available at https://github.com/feifeibear/long-context-attention.

USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

Code

Abstract

Reproductions