high performance computing on graphics processing units: hgpu.org

Posts

Dec, 12

Towards Domain-specific Computing for Stencil Codes in HPC

High Performance Computing (HPC) systems are nowadays more and more heterogeneous. Different processor types can be found on a single node including accelerators such as Graphics Processing Units (GPUs). To cope with the challenge of programming such complex systems, this work presents a domain-specific approach to automatically generate code tailored to different processor types. Low-level […]

CUDA

•

OpenCL

Dec, 10

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

General-purpose Graphic processing units (GPGPUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of HPC applications. To facilitate such high TLP, emerging programming models like CUDA and OpenCL allow programmers to create work abstractions in terms of smaller work units, called cooperative thread arrays (CTAs), consisting […]

CUDA

Dec, 8

OpenMP Programming on Intel R Xeon Phi TM Coprocessors: An Early Performance Comparison

The demand for more and more compute power is growing rapidly in many fields of research. Accelerators, like GPUs, are one way to fulfill these requirements, but they often require a laborious rewrite of the application using special programming paradigms like CUDA or OpenCL. The Intel(R) Xeon Phi(TM) coprocessor is based on the Intel(R) Many […]

Dec, 4

Fast Parallel Sorting Algorithms on GPUs

This paper presents a comparative analysis of the three widely used parallel sorting algorithms: OddEven sort, Rank sort and Bitonic sort in terms of sorting rate, sorting time and speed-up on CPU and different GPU architectures. Alongside we have implemented novel parallel algorithm: min-max butterfly network, for finding minimum and maximum in large data sets. […]

OpenCL

Dec, 4

A MPI back-end for the OpenACC accULL. Exploiting OpenACC semantics in Message Passing Clusters

The irruption in the HPC scene of hardware acceletarors has made available unprecedented performance to developers. However, even expert developers may not be ready to exploit the complex hierarchies of these new heterogeneous systems. We need to find a way to leverage the programming effort in these architectures at programming language level, otherwise, developers will […]

OpenCL

Dec, 1

CPUless PCs inside networked control systems

This paper represents results of adavancing our previous WSEAS paper[1] and is aimed to basics for design framework that helps design hard real-time control systems using Unix/Unix like operating systems. This framework is designed while solving research project supported by the Slovak Research and Development Agency under the contract No. VMSP-II-0034-09. This framework contains layer […]

OpenCL

Nov, 27

A Data-Parallel Algorithmic Modelica Extension for Efficient Execution on Multi-Core Platforms

New multi-core CPU and GPU architectures promise high computational power at a low cost if suitable computational algorithms can be developed. However, parallel programming for such architectures is usually non-portable, low-level and error-prone. To make the computational power of new multi-core architectures more easily available to Modelica modelers, we have developed the ParModelica algorithmic language […]

OpenCL

Nov, 26

A compiler toolkit for array-based languages targeting CPU/GPU hybrid systems

This paper presents a compiler toolkit that addresses two important emerging challenges: (1) effectively compiling dynamic array-based languages such as MATLAB, Python and R; and (2) effectively utilizing a wide range of rapidly evolving hybrid CPU/GPU architectures. The toolkit provides: a high-level IR specifically designed to express a wide range of arraybased computations and indexing […]

OpenCL

Nov, 24

GPU Isosurface Raycasting of FCC Datasets

This paper presents an efficient and accurate isosurface rendering algorithm for the natural C^1 splines on the face-centered cubic (FCC) lattice. Leveraging fast and accurate evaluation of a spline field and its gradient, accompanied by efficient empty-space skipping, the approach generates high-quality isosurfaces of FCC datasets at interactive speed (20-70 fps). The pre-processing computation (quasi-interpolation […]

OpenCL

•

OpenGL

Nov, 18

Auto-tunable GPU BLAS (thesis)

In this paper, we present our implementation of an Auto tuning system, written in C++, which incorporate the use of OpenCL kernels. We deploy this approach on different GPU architectures, evaluating the performance of the approach. Our main focus is to easily generate tuned code, that would otherwise require a large amount of empirical testing, […]

OpenCL

Nov, 14

Real-Time Scheduling Using GPUs – Advanced and More Accurate Proof of Feasibility

This paper will report our evaluation to use OpenCL as a platform for hard real-time scheduling. Especially, we have evaluated which types of tasks are faster on GPGPU than on CPU. We have investigated computational tasks, memory intensive tasks (especially tasks using low latency GDDR memory) and disk intensive tasks. This study is the part […]

OpenCL

Nov, 10

Efficient Dynamic Derived Field Generation on Many-Core Architectures Using Python

Derived field generation is a critical aspect of many visualization and analysis systems. This capability is frequently implemented by providing users with a language to create new fields and then translating their "programs" into a pipeline of filters that are combined in sequential fashion. Although this design is highly extensible and practical for development, the […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Towards Domain-specific Computing for Stencil Codes in HPC

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

OpenMP Programming on Intel R Xeon Phi TM Coprocessors: An Early Performance Comparison

Fast Parallel Sorting Algorithms on GPUs

A MPI back-end for the OpenACC accULL. Exploiting OpenACC semantics in Message Passing Clusters

CPUless PCs inside networked control systems

A Data-Parallel Algorithmic Modelica Extension for Efficient Execution on Multi-Core Platforms

A compiler toolkit for array-based languages targeting CPU/GPU hybrid systems

GPU Isosurface Raycasting of FCC Datasets

Auto-tunable GPU BLAS (thesis)

Real-Time Scheduling Using GPUs – Advanced and More Accurate Proof of Feasibility

Efficient Dynamic Derived Field Generation on Many-Core Architectures Using Python

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)