high performance computing on graphics processing units: hgpu.org

Posts

May, 8

cuPSO: GPU Parallelization for Particle Swarm Optimization Algorithms

Particle Swarm Optimization (PSO) is a stochastic technique for solving the optimization problem. Attempts have been made to shorten the computation times of PSO based algorithms with massive threads on GPUs (graphic processing units), where thread groups are formed to calculate the information of particles and the computed outputs for the particles are aggregated and […]

CUDA

May, 8

GPUNet: Searching the Deployable Convolution Neural Networks for GPUs

Customizing Convolution Neural Networks (CNN) for production use has been a challenging task for DL practitioners. This paper intends to expedite the model customization with a model hub that contains the optimized models tiered by their inference latency using Neural Architecture Search (NAS). To achieve this goal, we build a distributed NAS system to search […]

CUDA

May, 8

Analytical Performance Estimation during Code Generation on Modern GPUs

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box machine learning to select the best-performing configuration. This […]

CUDA

•

OpenCL

May, 1

The Celerity High-level API: C++20 for Accelerator Clusters

Providing convenient APIs and notations for data parallelism which remain accessible for programmers while still providing good performance has been a long-term goal of researchers as well as language and library designers. C++20 introduces ranges and views, as well as the composition of operations on them using a concise syntax, but the efficient implementation of […]

May, 1

CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems

Modern computing platforms tend to deploy multiple GPUs on a single node to boost performance. GPUs have large computing capacities and are an expensive resource. Increasing their utilization without causing performance degradation of individual workloads is an important and challenging problem. Although services such as NVIDIA’s MPS allow multiple cooperative kernels to simultaneously run on […]

CUDA

May, 1

End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning

To enable heterogeneous computing systems with autonomous programming and optimization capabilities, we propose a unified, end-to-end, programmable graph representation learning (PGL) framework that is capable of mining the complexity of high-level programs down to the universal intermediate representation, extracting the specific computational patterns and predicting which code segments would run best on a specific core […]

OpenCL

May, 1

Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow

The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, Automotive, Artificial Intelligence, Machine Learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs, etc, typically via […]

CUDA

•

OpenCL

May, 1

Efficient Execution of OpenMP on GPUs

OpenMP is the preferred choice for CPU parallelism in High-Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This allowed porting existing (CPU) code onto GPUs, including well established CPU parallelism paradigms. However, there are architectural differences between CPU and […]

CUDA

Apr, 17

Performance Comparison of Different OpenCL Implementations of LBM Simulation on Commodity Computer Hardware

Parallel programming is increasingly used to improve the performance of solving numerical methods used for scientific purposes. Numerical methods in the field of fluid dynamics require the calculation of a large number of operations per second. One of the methods that is easily parallelized and often used is the Lattice Boltzmann method (LBM). Today, it […]

OpenCL

Apr, 17

Explicit caching HYB: a new high-performance SpMV framework on GPGPU

Sparse Matrix-Vector Multiplication (SpMV) is a critical operation for the iterative solver of Finite Element Methods on computer simulation. Since the SpMV operation is a memory-bound algorithm, the efficiency of data movements heavily influenced the performance of the SpMV on GPU. In recent years, many research is conducted in accelerating the performance of SpMV on […]

CUDA

Apr, 17

Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm

Inverting matrices is a crucial part in many algorithms in linear algebra, computer graphics and data analysis. There are many libraries providing algorithms to achieve this but none that allow for calling from the GPU context. GPUs and accelerators become more and more prevalent in high performance computers. Having no ready-to-use implementation scientists need to […]

CUDA

•

OpenCL

Apr, 17

PM4Py-GPU: a High-Performance General-Purpose Library for Process Mining

Open-source process mining provides many algorithms for the analysis of event data which could be used to analyze mainstream processes (e.g., O2C, P2P, CRM). However, compared to commercial tools, they lack the performance and struggle to analyze large amounts of data. This paper presents PM4Py-GPU, a Python process mining library based on the NVIDIA RAPIDS […]

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

cuPSO: GPU Parallelization for Particle Swarm Optimization Algorithms

GPUNet: Searching the Deployable Convolution Neural Networks for GPUs

Analytical Performance Estimation during Code Generation on Modern GPUs

The Celerity High-level API: C++20 for Accelerator Clusters

CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems

End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning

Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow

Efficient Execution of OpenMP on GPUs

Performance Comparison of Different OpenCL Implementations of LBM Simulation on Commodity Computer Hardware

Explicit caching HYB: a new high-performance SpMV framework on GPGPU

Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm

PM4Py-GPU: a High-Performance General-Purpose Library for Process Mining

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)