high performance computing on graphics processing units: hgpu.org

Posts

Oct, 11

High Performance Parallel Design Based on Session Programming

Session programming is a programming model based on the theory of session types, a typing system for pi-calculus. Session types is developed to model structured interaction between processes and correctly typed process will have the property of communication safety. Session Java (SJ) is a full implementation of session types in Java. In this project, we […]

CUDA

Oct, 11

Static Compilation Analysis for Host-Accelerator Communication Optimization

We present an automatic, static program transformation that schedules and generates efficient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as […]

CUDA

Oct, 11

Using the CPU to Improve Performance in 3D Applications

Many applications in the film and game industries require multiple calculations to be performed on vast data sets. Any of these tools that are required to run in real-time, and be used interactively, must be developed with performance in mind. The following paper aims to explain how the Central Processing Unit can be utilised effectively […]

Oct, 11

A tutorial overview on the properties of the discrete cosine transform for encoded image and video processing

Discrete trigonometric transforms, such as the discrete cosine transform (DCT) and the discrete sine transform (DST), have been extensively used in signal processing for transform-based coding. The even type-II DCT, used in image and video coding, became specially popular to decorrelate the pixel data and minimize the spatial redundancy. Albeit this DCT tends to be […]

Oct, 10

Computitional intensive Tasks in Multimedia Signal Processing

Driven by the gaming industry and the great emphasis placed on the visual sense, graphics processing units (GPUs) have improved their performances in recent years, even outperforming the computational capacity of single core CPUs. In fact multi-core architectures are nowadays common for both CPUs and GPUs in order to exploit parallelism in computing. In this […]

CUDA

Oct, 10

A GPU-Accelerated Parallel Preconditioner for the Solution of the Boltzmann Transport Equation for Semiconductors

The solution of large systems of linear equations is typically achieved by iterative methods. The rate of convergence of these methods can be substantially improved by the use of preconditioners, which can be either applied in a black-box fashion to the linear system, or exploit properties specific to the underlying problem for maximum efficiency. However, […]

OpenCL

Oct, 10

Anti-parallel Patterns in Fine-grain Data-parallel Programs

Parallel systems and parallel programming are becoming increasingly more important. The developer in want of raw speed can no longer expect sequential processors to become faster and needs to turn to parallel platforms and parallel programs to satisfy his need for speed. But writing a parallel program is difficult and writing one with a decent […]

CUDA

Oct, 10

Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs

We put forward "anti-parallel patterns" to guide the parallel performance analysis process. Anti-parallel patterns or APPs are common parts of parallel programs that cause these programs to have less than ideal performance, where the ideal speedup equals the number of processors. We present benchmarks to model the behavior of APPs on parallel platforms. Each benchmark […]

OpenCL

Oct, 10

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Current parallelizing and optimizing compilers use techniques for the recognition of computational kernels to improve the quality of the target code. Domain-independent kernels characterize the computations carried out in an application, independently of the implementation details of a given programming language. This paper presents streaming-oriented parallelizing transformations for irregular assignment and irregular reduction kernels. The […]

Oct, 10

Evaluation of GPU Architectures Using Spiking Neural Networks

During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia’s Tesla C2050, codenamed Fermi, and AMD’s Radeon 5870 are two devices positioned to meet the computationally demanding needs of supercomputing research groups […]

OpenCL

Oct, 10

Towards an Effective Unified Programming Model for Many-Cores

Building an effective programming model for many-core processors is challenging. On the one hand, the increasing variety of platforms and their specific programming models force users to take a hardware-centric approach not only for implementing parallel applications, but also for designing them. This approach diminishes portability and, eventually, limits performance. On the other hand, to […]

OpenCL

Oct, 10

Denoising Volumetric Data on GPU

Volumetric data is currently gradually being used more and more in everyday aspect of our lives. Processing such data is computationally expensive and until now more sophisticated algorithms could not be used. The possibilities of processing such data have considerably widened since the increase of parallel computational power in modern GPUs. We present a novel […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

High Performance Parallel Design Based on Session Programming

Static Compilation Analysis for Host-Accelerator Communication Optimization

Using the CPU to Improve Performance in 3D Applications

A tutorial overview on the properties of the discrete cosine transform for encoded image and video processing

Computitional intensive Tasks in Multimedia Signal Processing

A GPU-Accelerated Parallel Preconditioner for the Solution of the Boltzmann Transport Equation for Semiconductors

Anti-parallel Patterns in Fine-grain Data-parallel Programs

Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Evaluation of GPU Architectures Using Spiking Neural Networks

Towards an Effective Unified Programming Model for Many-Cores

Denoising Volumetric Data on GPU

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)