high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

Istvan Z Reguly

Pázmány Péter Catholic University, Faculty of Information Technology and Bionics, Budapest, Hungary

arXiv:2309.10075 [cs.PF], (18 Sep 2023)

DOI:10.48550/arXiv.2309.10075

@misc{reguly2023evaluating,

title={Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications},

author={Istvan Z Reguly},

year={2023},

eprint={2309.10075},

archivePrefix={arXiv},

primaryClass={cs.PF}

}

Download (PDF)

View

Source

1447

views

In this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three major vendors; we evaluate performance on the Intel(R) Data Center GPU Max 1100, the NVIDIA A100 GPU, and the AMD MI250X GPU. Support on CPUs currently is less established, with DPC++ only supporting x86 CPUs through OpenCL, however, OpenSYCL does have an OpenMP backend capable of targeting all modern CPUs; we benchmark the Intel Xeon Platinum 8360Y Processor (Ice Lake), the AMD EPYC 9V33X (Genoa-X), and the Ampere Altra platforms. We study a range of primarily bandwidth-bound applications implemented using the OPS and OP2 DSLs, evaluate different formulations in SYCL, and contrast their performance to "native" programming approaches where available (CUDA/HIP/OpenMP). On GPU architectures SCYL on average even slightly outperforms native approaches, while on CPUs it falls behind – highlighting a continued need for improving CPU performance. While SYCL does not solve all the challenges of performance portability (e.g. needing different algorithms on different hardware), it does provide a single programming model and ecosystem to target most current HPC architectures productively.

Tags: AMD Radeon Instinct MI250X, ATI, Benchmarking, cfd, Computer science, CUDA, Fluid dynamics, HIP, Intel, Intel Ponte Vecchio Max 1100, nVidia, nVidia A100, OpenCL, performance portability, SYCL

September 24, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

Your response

Recent source codes

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)