high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak

Carnegie Mellon University

arXiv:2406.01566 [cs.DC], (3 Jun 2024)

DOI:10.48550/arXiv.2406.01566

@misc{mei2024helix,

title={Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs},

author={Yixuan Mei and Yonghao Zhuang and Xupeng Miao and Juncheng Yang and Zhihao Jia and Rashmi Vinayak},

year={2024},

eprint={2406.01566},

archivePrefix={arXiv},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

556

views

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem for a directed, weighted graph, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous cluster settings ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 2.7× and reduces prompting and decoding latency by up to 2.8× and 1.3×, respectively, compared to best existing approaches.

Tags: Computer science, GPU cluster, Heterogeneous systems, nVidia, nVidia A100, nVidia V100, Tesla T4

June 9, 2024 by hgpu

No votes yet.

Please wait...

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

* * *

high performance computing on graphics processing units: hgpu.org

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Recent source codes

Astaroth: A Scalable Multi-GPU Library for Stencil Computations

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking

Autotuning Methodology Software Package

HAL's MD package: Highly Accelerated Large-scale Molecular Dynamics simulations

Fast and Practical FPGA-based Strassen's Matrix Multiplication

Improved Models for Policy-Agent Learning of Compiler Directives in HLS

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

CuPBoP-AMD: Extending CUDA to AMD Platforms

Adopter: Automated Deep Learning Optimization via DSL-based Source Code Transformation

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

Most viewed papers (last 30 days)

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)