high performance computing on graphics processing units: hgpu.org

Posts

May, 9

Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs

The Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant performance drawbacks of PoCL on Intel CPUs – which run 92 % of the TOP500 list. Using a selection of benchmarks, […]

OpenCL

May, 9

Sylkan: Towards a Vulkan Compute Target Platform for SYCL

SYCL is a modern high-level C++ programming interface which excels at expressing data parallelism for heterogeneous hardware platforms in a programmer-friendly way, and is standardized by the Khronos Group. The latest version of the standard, SYCL 2020, removes the previous dependence of the specification and its implementations on an underlying OpenCL target, opening the door […]

OpenCL

May, 9

A fluid simulation system based on the MPS method

Fluid flow simulation is a highly active area with applications in a wide range of engineering problems and interactive systems. Meshless methods like the Moving Particle Semi-implicit (MPS) are a great alternative to deal efficiently with large deformations and free-surface flow. However, mesh-based approaches can achieve higher numerical precision than particle-based techniques with a performance […]

CUDA

May, 9

Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization

In this thesis, we investigate new ways to mitigate the inherent irregularity in sparse matrix factorizations and decompose the resulting computation into simple kernels which are portable across a diverse set of compute accelerator architectures through our novel compiler borG. Be it weather prediction, climate models, personalized medicine, genetic analysis and autonomous driving: some of […]

OpenCL

May, 9

Efficacy of Images Versus Data Buffers: Optimizing Interactive Applications Utilizing OpenCL for Scientific Visualization

This paper examines an algorithm using dual OpenCL image buffers to optimize data streaming for ensemble processing and visualization. Image buffers were utilized because they allow cached memory access, unlike simple data buffers, which are more commonly used. OpenCL image object performance was improved by allowing upload and mapping into one buffer to occur concurrently […]

OpenCL

May, 2

DeepfakeUCL: Deepfake Detection via Unsupervised Contrastive Learning

Face deepfake detection has seen impressive results recently. Nearly all existing deep learning techniques for face deepfake detection are fully supervised and require labels during training. In this paper, we design a novel deepfake detection method via unsupervised contrastive learning. We first generate two different transformed versions of an image and feed them into two […]

CUDA

May, 2

Enabling Energy-Efficient DNN Training on Hybrid GPU-FPGA Accelerators

DNN training consumes orders of magnitude more energy than inference and requires innovative use of accelerators to improve energy-efficiency. However, despite having complementary features, GPUs and FPGAs have been mostly used independently for the entire training process, thus neglecting the opportunity in assigning individual but distinct operations to the most suitable hardware. In this paper, […]

OpenCL

May, 2

Performance analysis and optimization of highly diverging algorithms on GPUs

In this thesis, the performance of the IceCube projects photon propagation code (clsim) is optimized. The process of GPU code analysis and performance optimization is described in detail. When run on the same hardware, the new version achieves a speedup of about 3x over the original implementation. Comparing the unmodified code on hardware currently used […]

CUDA

May, 2

Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode

The ultra-large-scale pre-training model can effectively improve the effect of a variety of tasks, and it also brings a heavy computational burden to inference. This paper introduces a series of ultra-large-scale pre-training model optimization methods that combine algorithm characteristics and GPU processor hardware characteristics, and on this basis, propose an inference engine — Easy and […]

CUDA

May, 2

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Fast Fourier Transform (FFT) is an essential tool in scientific and engineering computation. The increasing demand for mixed-precision FFT has made it possible to utilize half-precision floating-point (FP16) arithmetic for faster speed and energy saving. Specializing in lower precision, NVIDIA Tensor Cores can deliver extremely high computation performance. However, the fixed computation pattern makes it […]

Apr, 25

How to Train BERT with an Academic Budget

GPUs are now used for a wide range of problems within HPC. However, making efficient use of the computational power available with multiple GPUs is challenging. The main challenges in achieving good performance are memory layout, affecting memory bandwidth, effective use of the memory spaces with a GPU, inter-GPU communication, and synchronization. We address these […]

Apr, 25

Deep Graph Learning for Program Analysis and System Optimization

It has been increasingly challenging for the compilers to cope with the evolving computer architectures. The manually written compiler heuristics are not sufficiently wise to capture the impact of data and hardware related dependencies on performance. However, machine learning offers an opportunity to learn the common patterns in the existing dataset and predict the future […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs

Sylkan: Towards a Vulkan Compute Target Platform for SYCL

A fluid simulation system based on the MPS method

Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization

Efficacy of Images Versus Data Buffers: Optimizing Interactive Applications Utilizing OpenCL for Scientific Visualization

DeepfakeUCL: Deepfake Detection via Unsupervised Contrastive Learning

Enabling Energy-Efficient DNN Training on Hybrid GPU-FPGA Accelerators

Performance analysis and optimization of highly diverging algorithms on GPUs

Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

How to Train BERT with an Academic Budget

Deep Graph Learning for Program Analysis and System Optimization

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)