high performance computing on graphics processing units: hgpu.org

Posts

Aug, 3

Strategies for Optimization of Parallel Programs

Multi-core processors are present in most forms of computing, from a pocket-size smartphone to supercomputers. Consequently, parallel and concurrent programming has reemerged as a pressing concern for everyone interested in exploring all the potential computational power in these machines. Writing parallel, and specially concurrent, programs is not a trivial task as it requires a different […]

CUDA

•

OpenCL

Aug, 2

Real-Time Electroholography Using a Multi-GPU Environmental PC

We report a real-time electroholography using compact system composed of a multi-GPU environmental PC with four GPUs of Kepler architecture. Finally, our system can calculate 1,920×1,024 pixel CGH from the 3D object composed of 10,240 points in 40.3ms.

CUDA

•

OpenGL

Aug, 2

DRiVE: An Example of Distributed Rendering in Virtual Environments

Most Virtual Reality (VR) applications use rendering methods which implement local illumination models, simulating only direct interaction of light with 3D objects. They do not take into account the energy exchange between the objects themselves, making the resulting images look non-optimal. The main reason for this is the simulation of global illumination having a high […]

CUDA

Aug, 2

Large-Scale Sound Field Rendering in Rectangular Room with Specular Reflection

The sound field rendering is a technique to compute the sound field from the three-dimensional numerical models constructed in the computer, and it is the same concept as the graphics rendering in the computer graphics. In this paper, a GPU (Graphics Processing Unit) cluster system is applied to the sound field rendering for a large […]

CUDA

Aug, 2

Comparing CUDA, OpenCL and OpenGL Implementations of the Cardiac Monodomain Equations

Computer simulations of cardiac electrophysiology are a helpful tool in the study of bioelectric activity of the heart. The cardiac monodomain model comprises a nonlinear system of partial differential equations and its numerical solution represents a very intensive computational task due to the required fine spatial and temporal resolution. Recent studies have shown that the […]

CUDA

•

OpenCL

•

OpenGL

Aug, 1

NOVA: A Functional Language for Data Parallelism

Functional languages provide a solid foundation on which complex optimization passes can be designed to exploit available parallelism in the underlying system. Their mathematical foundations enable high-level optimizations that would be impossible in traditional imperative languages. This makes them uniquely suited for generation of efficient target code for parallel systems, such as multiple Central Processing […]

CUDA

Aug, 1

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

In this big data era, the capability of mining and analyzing large scale datasets is imperative. As data are becoming more abundant than ever before, data driven methods are playing a critical role in areas such as decision support and business intelligence. In this paper, we demonstrate how state-of-the-art GPUs and the Dynamic Parallelism feature […]

CUDA

Aug, 1

A note on the GPU acceleration of eigenvalue computations

Eigenvalue computations for large sparse matrices such as the Lanczos method are commonly based on Krylov subspace techniques. One of the dominant operations in such algorithms are iterated computations of inner products with the same vector in order to preserve orthogonality of the Krylov basis. These operations can be accelerated by existing BLAS functionality using […]

CUDA

•

OpenCL

Aug, 1

Matrix Convolution using Parallel Programming

The convolution theorem is used to multiply matrices of two different sizes i.e. matrices in which the number of rows in the first matrix is not equal to the number of columns in the second matrix. In this study, the multiplication of 3*3 and 4*4 matrices was done using MPI. A 3*3 matrix was taken […]

OpenCL

Aug, 1

GPU peer-to-peer techniques applied to a cluster interconnect

Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications […]

CUDA

Jul, 31

Iterative CT Reconstruction on the GPU

The computing power of modern GPUs makes them very suitable for Computed Tomography (CT) image reconstruction. Apart from accelerating the reconstruction, their extra computing performance compared to conventional CPUs can be used to increase image quality in several ways. In this paper we present our upgraded GPU based iterative reconstruction algorithm, including ML-TR (Maximum Likelihood […]

CUDA

Jul, 31

The Promises of Hybrid Hexagonal/Classical Tiling for GPU

Time-tiling is necessary for efficient execution of iterative stencil computations. But the usual hyper-rectangular tiles cannot be used because of positive/negative dependence distances along the stencil’s spatial dimensions. Several prior efforts have addressed this issue. However, known techniques trade enhanced data reuse for other causes of inefficiency, such as unbalanced parallelism, redundant computations, or increased […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Strategies for Optimization of Parallel Programs

Real-Time Electroholography Using a Multi-GPU Environmental PC

DRiVE: An Example of Distributed Rendering in Virtual Environments

Large-Scale Sound Field Rendering in Rectangular Room with Specular Reflection

Comparing CUDA, OpenCL and OpenGL Implementations of the Cardiac Monodomain Equations

NOVA: A Functional Language for Data Parallelism

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

A note on the GPU acceleration of eigenvalue computations

Matrix Convolution using Parallel Programming

GPU peer-to-peer techniques applied to a cluster interconnect

Iterative CT Reconstruction on the GPU

The Promises of Hybrid Hexagonal/Classical Tiling for GPU

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)