high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimising Purely Functional GPU Programs (Thesis)

Optimising Purely Functional GPU Programs (Thesis)

Trevor L. McDonell

School of Computer Science and Engineering, University of New South Wales

University of New South Wales, 2014

BibTeX

Download (PDF)

View

Source

Source codes

Package:

AccelerateHS

2344

views

It is well acknowledged that the dominant mechanism for scaling processor performance has become to increase the number of cores on a chip, rather than improve the performance of a single core. However, harnessing these extra cores to improve single application performance remains an extremely challenging task. A recent trend has been to use commodity graphics processing units to explore new algorithms and programming languages that can address the challenges of parallelism and scalability for devices containing hundreds or thousands of cores. The research documented in this dissertation builds upon the Accelerate language, an embedded domain specific language of purely functional collective operations over dense, multidimensional arrays. Overall, this dissertation explains how to efficiently implement such a language, ranging from optimisations to the input program and the generation of efficient target code, to optimisations in the runtime system which executes the resulting programs. In particular, I add backend-agnostic optimisations to the embedded array language, primarily in the form of a novel approach to array fusion. These optimisations are completely type preserving, amounting to a partial proof of correctness of the optimisations, checked by the type checker at compilation time. In addition, I develop the CUDA backend to Accelerate, which targets parallel execution on multicore graphics processing units (GPUs). I implement a dynamic quasi quotation-based code generation system for algorithmic skeletons that are instantiated at runtime. In order to minimise the overheads of runtime compilation, generated codes are compiled in parallel and asynchronously with other operations. Here, compiled binaries are cached for later reuse, including across program invocations. As a significant overhead in GPU programs is the transfer of data to and from the device, a garbage collection-based memory management system was implemented which minimises data transfers. Finally, I implement an execution engine that optimises the loading, configuration, and concurrent execution of the compiled GPU kernels. My results show that with these additions, programs written in Accelerate can be competitive in performance to programs written in traditional lower-level languages, while enjoying a much higher level of abstraction. Through the work done in this thesis, Accelerate has become the dominant method for GPGPU programming in Haskell.

Tags: Code generation, Computer science, CUDA, nVidia, Package, Performance, Tesla T10, Thesis

August 10, 2014 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Optimising Purely Functional GPU Programs (Thesis)

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Optimising Purely Functional GPU Programs (Thesis)

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)