high performance computing on graphics processing units: hgpu.org

Posts

Aug, 21

BenchPress: A Deep Active Benchmark Generator

We develop BenchPress, the first ML benchmark generator for compilers that is steerable within feature space representations of source code. BenchPress synthesizes compiling functions by adding new code in any part of an empty or existing sequence by jointly observing its left and right context, achieving excellent compilation rate. BenchPress steers benchmark generation towards desired […]

OpenCL

Aug, 21

Optimization of GPU workloads using natural language processing based on deep learning techniques

Setting program parameters is challenging due to the abstract relationship between hardware and software. Automatic optimization algorithms that are accurate are required to cope with the complexity and variety of current hardware and software. Autotuning has always relied on time-consuming trial and error approaches. Machine learning (ML) and Natural Language Processing (NLP) has flourished over […]

CUDA

•

OpenCL

Aug, 7

Design and Implementation of ShenWei Universal C/C++

The ShenWei many-core series processors powering multiple cutting-edge supercomputers are equipped with their unique on-chip heterogeneous architecture. They have long required programmers to write separate codes for the control part on Management Processing Element (MPE) and accelerated part on Compute Processing Element (CPE), which is similar to open standards like OpenCL. Such a programming model […]

CUDA

•

OpenCL

Jul, 17

Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments

With the improvement of global infrastructure, Cyber-Physical Systems (CPS) have become an important component of Industry 4.0. Both the application as well as the machine work together to improve the task of interdependencies. Machine learning methods in CPS require the monitoring of computational algorithms, including adopting optimizations, fine-tuning cyber systems, improving resource utilization, as well […]

OpenCL

Jul, 10

Design and Implementation of CNN-FPGA accelerator based on Open Computing Language

In a wide range of applications, convolutional neural networks (CNNs) have been widely used, including face and speech recognition, picture retrieval and classification, and automated driving. As a result, CNN accelerators have become a popular topic of discourse. CNN Accelerators Graphics processing units (GPU) are often employed in CNN accelerators, and they are referred to […]

OpenCL

Jun, 19

Securing GPU via Region-based Bounds Checking

Graphics processing units (GPUs) have become essential general-purpose computing platforms to accelerate a wide range of workloads, such as deep learning, scientific, and high-performance computing (HPC) applications. However, recent memory corruption attacks, such as buffer overflow, exposed security vulnerabilities in GPUs. We demonstrate that out-of-bounds writes are reproducible on an Nvidia GPU, which can enable […]

CUDA

•

OpenCL

May, 22

GPU Ray Tracing with Monte Carlo Methods

Monte Carlo methods are various techniques aimed at obtaining numerical results through simulations with random samples: the base idea of Monte Carlo methods is to generate a sequence of random numbers and execute the same algorithm on each one of them or in groups, then the resulting outputs are combined to obtain the final result. […]

OpenCL

May, 8

Experience of Migrating a Parallel Graph Coloring Program from CUDA to SYCL

We describe the experience of converting a CUDA implementation of a parallel graph coloring algorithm to SYCL. The goals are for our work to be useful to application and compiler developers by providing a detailed description of migration paths between CUDA and SYCL. We will describe how CUDA functions are mapped to SYCL functions. Evaluating […]

CUDA

•

OpenCL

May, 1

Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow

The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, Automotive, Artificial Intelligence, Machine Learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs, etc, typically via […]

CUDA

•

OpenCL

Apr, 17

Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm

Inverting matrices is a crucial part in many algorithms in linear algebra, computer graphics and data analysis. There are many libraries providing algorithms to achieve this but none that allow for calling from the GPU context. GPUs and accelerators become more and more prevalent in high performance computers. Having no ready-to-use implementation scientists need to […]

CUDA

•

OpenCL

Apr, 10

Optimizing Performance and Energy Efficiency in Massively Parallel Systems

Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator […]

OpenCL

Mar, 6

Integrating SkePU’s algorithmic skeletons with GPI on a cluster

As processors’ clock-speed flattened out in the early 2000s, multi-core processors became more prevalent and so did parallel programming. However this programming paradigm introduces additional complexities, and to combat this, the SkePU framework was created. SkePU does this by offering a single-threaded interface which executes the user’s code in parallel in accordance to a chosen […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

BenchPress: A Deep Active Benchmark Generator

Optimization of GPU workloads using natural language processing based on deep learning techniques

Design and Implementation of ShenWei Universal C/C++

Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments

Design and Implementation of CNN-FPGA accelerator based on Open Computing Language

Securing GPU via Region-based Bounds Checking

GPU Ray Tracing with Monte Carlo Methods

Experience of Migrating a Parallel Graph Coloring Program from CUDA to SYCL

Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow

Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm

Optimizing Performance and Energy Efficiency in Massively Parallel Systems

Integrating SkePU’s algorithmic skeletons with GPI on a cluster

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)