high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters

Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters

Ruobing Han, Hyesoon Kim

Georgia Institute of Technology, Atlanta, GA, United States

31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP’26), 2026

DOI:10.1145/3774934.3786435

@inproceedings{han2026scaling,

title={Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters},

author={Han, Ruobing and Kim, Hyesoon},

booktitle={Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming},

pages={355–368},

year={2026}

}

Download (PDF)

View

Source

330

views

The growing demand for GPU resources has led to widespread shortages in data centers, prompting the exploration of CPUs as an alternative for executing GPU programs. While prior research supports executing GPU programs on single CPUs, these approaches struggle to achieve competitive performance due to the computational capacity gap between GPUs and CPUs. To further improve performance, we introduce CuCC, a framework that scales GPU-to-CPU migration to CPU clusters and utilizes distributed CPU nodes to execute GPU programs. Compared to single-CPU execution, CPU cluster execution requires cross-node communication to maintain data consistency. We present the CuCC execution workflow and communication optimizations, which aim to reduce network overhead. Evaluations demonstrate that CuCC achieves high scalability on large-scale CPU clusters and delivers runtimes approaching those of GPUs. In terms of cluster-wide throughput, CuCC enables CPUs to achieve an average of 2.59x higher throughput than GPUs.

Tags: Compilers, Computer science, CUDA, nVidia, nVidia A100, nVidia V100, Triton

February 8, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

high performance computing on graphics processing units: hgpu.org

Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters

Your response

Recent source codes

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

Most viewed papers (last 30 days)

Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)