high performance computing on graphics processing units: hgpu.org

Posts

Feb, 11

Scalability of Self-organizing Maps on a GPU cluster using OpenCL and CUDA

We evaluate a novel implementation of a Self-Organizing Map (SOM) on a Graphics Processing Unit (GPU) cluster. Using various combinations of OpenCL, CUDA, and two different graphics cards, we demonstrate the scalability of the SOM implementation on one to eight GPUs. Results indicate that while the algorithm scales well with the number of training samples […]

CUDA

•

OpenCL

Feb, 10

Automatic Performance Optimization in ViennaCL for GPUs

Highly parallel computing architectures such as graphics processing units (GPUs) pose several new challenges for scientific computing, which have been absent on single core CPUs. However, a transition from existing serial code to parallel code for GPUs often requires a considerable amount of effort. The Vienna Computing Library (ViennaCL) presented in the beginning of this […]

OpenCL

Feb, 10

Customizing Instruction Set Extensible Reconfigurable Processors using GPUs

Many reconfigurable processors allow their instruction sets to be tailored according to the performance requirements of target applications. They have gained immense popularity in recent years because of this flexibility of adding custom instructions. However, most design automation algorithms for instruction set customization (like enumerating and selecting the optimal set of custom instructions) are computationally […]

CUDA

•

OpenCL

Feb, 10

Ensemble K-means on multi-core architectures

Ensemble problems uses multiple models generated from a data set to improve the correctness and ensure faster convergence. The use of multiple models makes ensemble problems computationally intensive. In this paper, we explore the parallelization of ensemble problems on modern multicore hardware like CPUs and GPUs. We use the K-means clustering algorithm as a case […]

OpenCL

Feb, 10

Implementing Molecular Dynamics on Hybrid High Performance Computers – Particle-Particle Particle-Mesh

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more […]

CUDA

•

OpenCL

Feb, 10

Real-Time SAH BVH Construction for Ray Tracing Dynamic Scenes

This work is aimed at the development of effective algorithms for building of full SAH BVH trees on GPU in real-time. In this work it is presupposed that all the scene objects are represented by a number of triangles (the so-called "triangle soup"), at the same time the arbitrary changes in the geometry are allowed […]

OpenCL

Feb, 9

Accelerating H.264 Advanced Video Coding with GPU/CUDA Technology

With the rise of streaming media on the Internet and the YouTube revolution, the demand for online videos is costing companies a significant amount of bandwidth. To alleviate the resources needed for streaming media, video compression removes redundant information and minimizes the loss in quality experienced by a human audience. In response to the need […]

CUDA

Feb, 9

Parallel Semi-Implicit Time Integrators

In this paper, we further develop a family of parallel time integrators known as Revisionist Integral Deferred Correction methods (RIDC) to allow for the semi-implicit solution of time dependent PDEs. Additionally, we show that our semi-implicit RIDC algorithm can harness the computational potential of multiple general purpose graphical processing units (GPGPUs) by utilizing existing CUBLAS […]

CUDA

Feb, 9

The Boat Hull Model: Adapting the Roofline Model to Enable Performance Prediction for Parallel Computing

Multi-core and many-core were already major trends for the past six years, and are expected to continue for the next decades. With these trends of parallel computing, it becomes increasingly difficult to decide on which architecture to run a given application. In this work, we use an algorithm classification to predict performance prior to algorithm […]

CUDA

Feb, 9

CudaRF: A CUDA-based Implementation of Random Forests

Machine learning algorithms are frequently applied in data mining applications. Many of the tasks in this domain concern high-dimensional data. Consequently, these tasks are often complex and computationally expensive. This paper presents a GPU-based parallel implementation of the Random Forests algorithm. In contrast to previous work, the proposed algorithm is based on the compute unified […]

CUDA

Feb, 9

Real-time simulation of a spiking neural network model of the basal ganglia circuitry using general purpose computing on graphics processing units

Real-time simulation of a biologically realistic spiking neural network is necessary for evaluation of its capacity to interact with real environments. However, the real-time simulation of such a neural network is difficult due to its high computational costs that arise from two factors: (1) vast network size and (2) the complicated dynamics of biologically realistic […]

CUDA

Feb, 8

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

This paper develops and evaluates search and optimization techniques for auto-tuning 3D stencil (nearest-neighbor) computations on GPUs. Observations indicate that parameter tuning is necessary for heterogeneous GPUs to achieve optimal performance with respect to a search space. Our proposed framework takes a most concise specification of stencil behavior from the user as a single formula, […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Scalability of Self-organizing Maps on a GPU cluster using OpenCL and CUDA

Automatic Performance Optimization in ViennaCL for GPUs

Customizing Instruction Set Extensible Reconfigurable Processors using GPUs

Ensemble K-means on multi-core architectures

Implementing Molecular Dynamics on Hybrid High Performance Computers – Particle-Particle Particle-Mesh

Real-Time SAH BVH Construction for Ray Tracing Dynamic Scenes

Accelerating H.264 Advanced Video Coding with GPU/CUDA Technology

Parallel Semi-Implicit Time Integrators

The Boat Hull Model: Adapting the Roofline Model to Enable Performance Prediction for Parallel Computing

CudaRF: A CUDA-based Implementation of Random Forests

Real-time simulation of a spiking neural network model of the basal ganglia circuitry using general purpose computing on graphics processing units

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)