high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Whole-function vectorization

Whole-function vectorization

Ralf Karrenberg, Sebastian Hack

Saarland University

9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2011

DOI:10.1109/CGO.2011.5764682

BibTeX

Download (PDF)

View

Source

1655

views

Data-parallel programming languages are an important component in today’s parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and “general-purpose” languages like CUDA or OpenCL. Current implementations of those languages on CPUs solely rely on multi-threading to implement parallelism and ignore the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel’s SSE and the upcoming AVX or Larrabee instruction sets). In this paper, we discuss several aspects of implementing dataparallel languages on machines with SIMD instruction sets. Our main contribution is a language- and platform-independent code transformation that performs whole-function vectorization on low-level intermediate code given by a control flow graph in SSA form. We evaluate our technique in two scenarios: First, incorporated in a compiler for a domain-specific language used in realtime ray tracing. Second, in a stand-alone OpenCL driver. We observe average speedup factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for different OpenCL kernels.

Tags: Code generation, Computer science, OpenCL, Programming techniques

May 11, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Whole-function vectorization

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Whole-function vectorization

Share this:

Recent source codes

Most viewed papers (last 30 days)