29540

Context Parallelism for Scalable Million-Token Inference

Amy (Jie) Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, Jianyu Huang
Meta Platforms, Inc.
arXiv:2411.01783 [cs.DC], (10 Nov 2024)

@misc{yang2024contextparallelismscalablemilliontoken,

   title={Context Parallelism for Scalable Million-Token Inference},

   author={Amy Yang and Jingyi Yang and Aya Ibrahim and Xinfeng Xie and Bangsheng Tang and Grigory Sizov and Jeremy Reizenstein and Jongsoo Park and Jianyu Huang},

   year={2024},

   eprint={2411.01783},

   archivePrefix={arXiv},

   primaryClass={cs.DC},

   url={https://arxiv.org/abs/2411.01783}

}

Download Download (PDF)   View View   Source Source   

517

views

We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: