high performance computing on graphics processing units: hgpu.org

Posts

Aug, 19

libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms

Hardware accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors (PHIs), and Field-Programmable Gate Arrays (FPGAs) are now ubiquitous in extreme-scale high performance computing (HPC), cloud, and Big data platforms to facilitate execution of workloads that demand high energy efficiency. They present unique interfaces and programming models therefore posing several limitations, which must […]

CUDA

•

OpenCL

Aug, 11

A Cross-platform Evaluation of Graphics Shader Compiler Optimization

For real-time graphics applications such as games and virtual reality, performance is crucial to provide a smooth user experience. Central to this is the performance of shader programs which render images on the GPU. The rise of low-level graphics APIs such as Vulkan means compilation tools play an increasingly important role in the graphics ecosystem. […]

Aug, 11

GPU parallelization of a hybrid pseudospectral fluid turbulence framework using CUDA

An existing hybrid MPI-OpenMP scheme is augmented with a CUDA-based fine grain parallelization approach for multidimensional distributed Fourier transforms, in a well-characterized pseudospectral fluid turbulence code. Basics of the hybrid scheme are reviewed, and heuristics provided to show a potential benefit of the CUDA implementation. The method draws heavily on the CUDA runtime library to […]

CUDA

Aug, 11

A Case Study in Using OpenCL on FPGAs: Creating an Open-Source Accelerator of the AutoDock Molecular Docking Software

In recent years, OpenCL has been increasingly adopted as it enables software programmers to harness the performance and power efficiency of FPGAs. Despite simplifying the FPGA programming challenge, achieving high performance and energy efficiency with OpenCL is still a difficult task. In order to further contribute to the advance of the OpenCL usage for FPGAs, […]

OpenCL

Aug, 11

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Recent work has shown how to train Convolutional Neural Networks (CNNs) rapidly on large image datasets, then transfer the knowledge gained from these models to a variety of tasks. Following [Radford 2017], in this work, we demonstrate similar scalability and transfer for Recurrent Neural Networks (RNNs) for Natural Language tasks. By utilizing mixed precision arithmetic […]

CUDA

Aug, 11

Parallax: Automatic Data-Parallel Training of Deep Neural Networks

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in machine learning (ML). ML frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to assist ML researchers to train their models in a distributed fashion. However, correctly and efficiently utilizing multiple machines and GPUs is […]

CUDA

Aug, 5

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

For a deep learning model, efficient execution of its computation graph is key to achieving high performance. Previous work has focused on improving the performance for individual nodes of the computation graph, while ignoring the parallelization of the graph as a whole. However, we observe that running multiple operations simultaneously without interference is critical to […]

Aug, 5

Energy-based Tuning of Convolutional Neural Networks on Multi-GPUs

Deep Learning (DL) applications are gaining momentum in the realm of Artificial Intelligence, particularly after GPUs have demonstrated remarkable skills for accelerating their challenging computational requirements. Within this context, Convolutional Neural Network (CNN) models constitute a representative example of success on a wide set of complex applications, particularly on datasets where the target can be […]

CUDA

Aug, 5

GPU schedulers: how fair is fair enough?

Blocking synchronisation idioms, e.g. mutexes and barriers, play an important role in concurrent programming. However, systems with semi-fair schedulers, e.g. graphics processing units (GPUs), are becoming increasingly common. Such schedulers provide varying degrees of fairness, guaranteeing enough to allow some, but not all, blocking idioms. While a number of applications that use blocking idioms do […]

OpenCL

Aug, 5

OpenCLIPER: an OpenCL-based C++ Framework for Overhead-Reduced Medical Image Processing and Reconstruction on Heterogeneous Devices

Medical image processing is often limited by the computational cost of the involved algorithms. Whereas dedicated computing devices (GPUs in particular) exist and do provide significant efficiency boosts, they have an extra cost of use in terms of housekeeping tasks (device selection and initialization, data streaming, synchronization with the CPU and others), which may hinder […]

OpenCL

Aug, 5

CRUM: Checkpoint-Restart Support for CUDA’s Unified Memory

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs […]

CUDA

Jul, 28

Elementary functions: towards automatically generated, efficient, and vectorizable implementations

Elementary mathematical functions are pervasive in many high performance computing programs. However, although the mathematical libraries (libms), on which these programs rely, generally provide several flavors of the same function, these are fixed at implementation time. Hence this monolithic characteristic of libms is an obstacle for the performance of programs relying on them, because they […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms

A Cross-platform Evaluation of Graphics Shader Compiler Optimization

GPU parallelization of a hybrid pseudospectral fluid turbulence framework using CUDA

A Case Study in Using OpenCL on FPGAs: Creating an Open-Source Accelerator of the AutoDock Molecular Docking Software

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Parallax: Automatic Data-Parallel Training of Deep Neural Networks

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

Energy-based Tuning of Convolutional Neural Networks on Multi-GPUs

GPU schedulers: how fair is fair enough?

OpenCLIPER: an OpenCL-based C++ Framework for Overhead-Reduced Medical Image Processing and Reconstruction on Heterogeneous Devices

CRUM: Checkpoint-Restart Support for CUDA’s Unified Memory

Elementary functions: towards automatically generated, efficient, and vectorizable implementations

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)