high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio

cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio

Yafan Huang, Sheng Di, Guanpeng Li, Franck Cappello

Computer Science Department, University of Iowa, Iowa City, IA, USA

International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2024

DOI:10.1109/SC41406.2024.00021

@inproceedings{huang2024cuszp2,

title={cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio},

author={Huang, Yafan and Di, Sheng and Li, Guanpeng and Cappello, Franck},

booktitle={SC24: International Conference for High Performance Computing, Networking, Storage and Analysis},

pages={1–18},

year={2024},

organization={IEEE}

}

Download (PDF)

View

Source

Source codes

Package:

cuSZp: Fast GPU error-bounded lossy compressor for floating-point data

1507

views

Existing GPU lossy compressors suffer from expensive data movement overheads, inefficient memory access patterns, and high synchronization latency, resulting in limited throughput. This work proposes CUSZP2, a generic single-kernel error-bounded lossy compressor purely on GPUs designed for applications that require high speed, such as large-scale GPU simulation and large language model training. In particular, CUSZP2 proposes a novel lossless encoding method, optimizes memory access patterns, and hides synchronization latency, achieving extreme end-to-end throughput and optimized compression ratio. Experiments on NVIDIA A100 GPU with 9 real-world HPC datasets demonstrate that, even with higher compression ratios and data quality, CUSZP2 can deliver on average 332.42 and 513.04 GB/s end-to-end throughput for compression and decompression, respectively, which is around 2x of existing pure-GPU compressors and 200x of CPU-GPU hybrid compressors.

Tags: Compression, Computer science, CUDA, nVidia, nVidia A100, nVidia GeForce RTX 3080, nVidia GeForce RTX 3090, Package, PTX

February 16, 2025 by hgpu

Rating: 5.0/5. From 2 votes.

Please wait...

Your response

You must be logged in to post a comment.

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

high performance computing on graphics processing units: hgpu.org

cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio

Package:

Your response

Recent source codes

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

Most viewed papers (last 30 days)

cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)