high performance computing on graphics processing units: hgpu.org

Posts

Aug, 1

A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems

Aiming to fully exploit the computing power of all CPUs and all GPUs on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, as well as to accommodate the heterogeneity between CPUs and GPUs. The new […]

CUDA

Aug, 1

Optimizing performance per watt on GPUs in High Performance Computing: temperature, frequency and voltage effects

The magnitude of the real-time digital signal processing challenge attached to large radio astronomical antenna arrays motivates use of high performance computing (HPC) systems. The need for high power efficiency (performance per watt) at remote observatory sites parallels that in HPC broadly, where efficiency is an emerging critical metric. We investigate how the performance per […]

CUDA

Aug, 1

Discriminative Convolutional Sum-Product Networks on GPU

Sum-Product Networks (SPNs) are a deep architecture recently proposed for image classification and modeling. In contrast to loopy graphical models commonly used in computer vision, exact inference and learning in SPNs is tractable. As long as consistency and completeness are ensured, an SPN allows to efficiently calculate the partition function and all marginals of graphical […]

CUDA

Jul, 30

Automatic Parallelization of Tiled Stencil Loop Nests on GPUs

This thesis attempts to design and implement a compiler framework based on the polyhedral model. The compiler automatically parallelizes loop nests; especially stencil kernels, into efficient GPU code by loop tiling transformations which the polyhedral model describes. To enhance parallel performance, we introduce three practically efficient techniques to process different types of loop nests. The […]

CUDA

Jul, 30

Dynamic Data Management Among Multiple Databases for Optimization of Parallel Computations in Heterogeneous HPC Systems

Rapid development of diverse computer architectures and hardware accelerators caused that designing parallel systems faces new problems resulting from their heterogeneity. Our implementation of a parallel system called KernelHive allows to efficiently run applications in a heterogeneous environment consisting of multiple collections of nodes with different types of computing devices. The execution engine of the […]

OpenCL

Jul, 30

Scaling Multifluid Compressible Fluid Dynamics to 700,000 cores, 1.5 Pflop/s, and a Trillion Grid Cells

We are using the Blue Waters system at NCSA to study compressible, turbulent mixing of gases in the deep interiors of stars and also in the context of inertial confinement fusion (ICF). In December, 2012, during the Blue Waters friendly user access period, we carried out a simulation of an ICF test problem on a […]

Jul, 30

Research on Parallel DVH Statistic Based on CUDA

Dose Volume Histogram(DVH) is necessary for evaluating radiotherapy planning. With the increase of patient CT slices and the development of intensity-modulated radiation therapy(IMRT) technology, statistical process of DVH requires a large number of cubic interpolation calculation, and the sequential single threaded DVH code on the CPU can not meet the real-time requirement. The paper presents […]

CUDA

Jul, 30

A CUDA-Based Real Parameter Optimization Benchmark

Benchmarking is key for developing and comparing optimization algorithms. In this paper, a CUDA-based real parameter optimization benchmark (cuROB) is introduced. Test functions of diverse properties are included within cuROB and implemented efficiently with CUDA. Speedup of one order of magnitude can be achieved in comparison with CPU-based benchmark of CEC’14.

CUDA

Jul, 29

Optimizing Lempel-Ziv Factorization for the GPU Architecture

Lossless data compression is used to reduce storage requirements, allowing for the relief of I/O channels and better utilization of bandwidth. The Lempel-Ziv lossless compression algorithms form the basis for many of the most commonly used compression schemes. General purpose computing on graphic processing units (GPGPUs) allows us to take advantage of the massively parallel […]

CUDA

Jul, 29

Implicit Methods for Real-Time simulation of Interactive Waves

The project focuses on developing a simulator in which ships and waves interact. The new wave model is the Variational Boussinesq model (VBM). However, this new realistic model brings much more computation effort with it. The VBM mainly requires an unsteady state solver, that solves a coupled system of equations at each frame (20 fps). […]

CUDA

Jul, 29

Parallel Worldline Numerics: Implementation and Error Analysis

We give an overview of the worldline numerics technique, and discuss the parallel CUDA implementation of a worldline numerics algorithm. In the worldline numerics technique, we wish to generate an ensemble of representative closed-loop particle trajectories, and use these to compute an approximate average value for Wilson loops. We show how this can be done […]

CUDA

Jul, 29

Mixed-precision orthogonalization scheme and its case studies with CA-GMRES on a GPU

We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 32 or 64-bit floating-point precision, but uses higher-precision arithmetics to accumulate its intermediate results. For the 64-bit precision, our scheme uses software emulation for the higher-precision arithmetics, and requires about 20x more computation but about the same amount of communication as […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems

Optimizing performance per watt on GPUs in High Performance Computing: temperature, frequency and voltage effects

Discriminative Convolutional Sum-Product Networks on GPU

Automatic Parallelization of Tiled Stencil Loop Nests on GPUs

Dynamic Data Management Among Multiple Databases for Optimization of Parallel Computations in Heterogeneous HPC Systems

Scaling Multifluid Compressible Fluid Dynamics to 700,000 cores, 1.5 Pflop/s, and a Trillion Grid Cells

Research on Parallel DVH Statistic Based on CUDA

A CUDA-Based Real Parameter Optimization Benchmark

Optimizing Lempel-Ziv Factorization for the GPU Architecture

Implicit Methods for Real-Time simulation of Interactive Waves

Parallel Worldline Numerics: Implementation and Error Analysis

Mixed-precision orthogonalization scheme and its case studies with CA-GMRES on a GPU

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)