high performance computing on graphics processing units: hgpu.org

Posts

Mar, 18

Improving Cache Locality for Ray Casting with CUDA

In this paper, we present an acceleration method for texture-based ray casting on the compute unified device architecture (CUDA) compatible graphics processing unit (GPU). Since ray casting is a memory-intensive application, our method increases the hit rate of the texture cache during rendering. To achieve this, our method dynamically selects the width and height of […]

CUDA

Mar, 18

Towards user transparent parallel multimedia computing on GPU-clusters

The research area of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia archives and data streams. To satisfy the increasing computational demands of MMCA problems, the use of High Performance Computing (HPC) techniques is essential. As most MMCA researchers are not HPC experts, there is an urgent need […]

CUDA

Mar, 18

Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

Lack of efficient and transparent interaction with GPU data in hybrid MPI GPU environments challenges GPU acceleration of largescale scientific and engineering computations. A particular challenge is the efficient transfer of noncontiguous data to and from GPU memory. MPI supports such transfers through the use of datatypes, however an efficient means of utilizing datatypes for […]

CUDA

Mar, 18

Usable assembly language for GPUs: a success story

The NVIDIA compilers nvcc and ptxas leave the programmer with only very limited control over register allocation, register spills, instruction selection, and instruction scheduling. In theory a programmer can gain control by writing an entire kernel in van der Laan’s cudasm assembly language, but this requires tedious, error-prone tracking of register assignments. This paper introduces […]

CUDA

Mar, 18

VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

Graphics processing units (GPUs) have been widely used for general purpose computation acceleration. However, current programming models such as CUDA and OpenCL can support GPUs only on the local computing node, where the application execution is tightly coupled to the physical GPU hardware. In this work, we propose a virtual OpenCL (VOCL) framework to support […]

OpenCL

Mar, 16

Globally scheduled real-time multiprocessor systems with GPUs

Graphics processing units, GPUs, are powerful processors that can offer significant performance advantages over traditional CPUs. The last decade has seen rapid advancement in GPU computational power and generality. Recent technologies make it possible to use GPUs as co-processors to CPUs. The performance advantages of GPUs can be great, often outperforming traditional CPUs by orders […]

CUDA

Mar, 16

CUDA 2D Stencil Computations for the Jacobi Method

We are witnessing the consolidation of the GPUs streaming paradigm in parallel computing. This paper explores stencil operations in CUDA to optimize on GPUs the Jacobi method for solving Laplace’s differential equation. The code keeps constant the access pattern through a large number of loop iterations, that way being representative of a wide set of […]

CUDA

Mar, 16

Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms: Parallel Solvers and Preconditioners

Partial differential equations are typically solved by means of finite difference, finite volume or finite element methods resulting in large, highly coupled, ill-conditioned and sparse (non-)linear systems. In order to minimize the computing time we want to exploit the capabilities of modern parallel architectures. The rapid hardware shifts from single core to multi-core and many-core […]

CUDA

Mar, 16

Developing a CUDA solver for large sparse matrices for MARIN

This masters thesis has been written for the degree of Master of Science in Applied Mathematics at the faculty of Electrical Engineering, Mathematics and Computer Sciences of Delft University of Technology. The report ends a nine month internship carried out at Maritime Research Institute Netherlands (MARIN). MARIN supplies innovative products for the offshore industry and […]

CUDA

Mar, 16

Multi-platform Linear Algebra

HiFlow3 is a multi-purpose finite element software providing powerful tools for efficient and accurate solution of a wide range of problems modeled by partial differential equations (PDEs). Based on object-oriented concepts and the full capabilities of C++ the HiFlow3 project follows a modular and generic approach for building efficient parallel numerical solvers. It provides highly […]

CUDA

Mar, 15

On the Use of Small 2D Convolutions on GPUs

Computing many small 2D convolutions using FFTs is a basis for a large number of applications in many domains in science and engineering, among them electromagnetic diffraction modeling in physics. The GPU architecture seems to be a suitable architecture to accelerate these convolutions, but reaching high application performance requires substantial development time and non-portable optimizations. […]

CUDA

Mar, 15

Iterative Statistical Kernels on Contemporary GPUs

We present a study of three important kernels that occur frequently in iterative statistical applications: Multi-Dimensional Scaling (MDS), PageRank, and K-Means. We implemented each kernel using OpenCL and evaluated their performance on NVIDIA Tesla and NVIDIA Fermi GPGPU cards using dedicated hardware, and in the case of Fermi, also on the Amazon EC2 cloud-computing environment. […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Improving Cache Locality for Ray Casting with CUDA

Towards user transparent parallel multimedia computing on GPU-clusters

Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

Usable assembly language for GPUs: a success story

VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units

Globally scheduled real-time multiprocessor systems with GPUs

CUDA 2D Stencil Computations for the Jacobi Method

Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms: Parallel Solvers and Preconditioners

Developing a CUDA solver for large sparse matrices for MARIN

Multi-platform Linear Algebra

On the Use of Small 2D Convolutions on GPUs

Iterative Statistical Kernels on Contemporary GPUs

Recent source codes

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

Most viewed papers (last 30 days)