high performance computing on graphics processing units: hgpu.org

Posts

Aug, 16

High accuracy solutions to energy gradient flows from material science models

A computational framework is presented for materials science models that come from energy gradient flows. The models of interest lead to the evolution of structure involving two or more phases. The framework includes higher order derivative models and vector problems. Solutions are considered in periodic cells and standard Fourier spectral discretization in space is used. […]

CUDA

Aug, 16

Parametric Flows: Automated Behavior Equivalencing for Symbolic Analysis of Races in CUDA Programs

The growing scale of concurrency requires automated abstraction techniques to cut down the effort in concurrent system analysis. In this paper, we show that the high degree of behavioral symmetry present in GPU programs allows CUDA race detection to be dramatically simplified through abstraction. Our abstraction techniques is one of automatically creating parametric flows

CUDA

Aug, 16

Efficient GPU implementation of parameter estimation of a statistical model for online advertisement optimization

The optimization problem of estimating parameters using a maximum a-posterior (MAP) [3] approach on a non-linear statistical model with a large data set can be solved using an L-BFGS [10] algorithm. When dealing with an ever changing reality, the evaluation need to be fast to capture the immediacy of the observations. This thesis will present […]

CUDA

Aug, 15

vCUDA Framework Development for GPU Virtualization

vCUDA is a middleware that allows an application to use a CUDA-compatible graphics processing unit (GPU) installed in a remote computer as if it were installed in the computer where the application is being executed. vCUDA is designed following the client-server distributed architecture. On one side, the client employs a library of wrappers to the […]

CUDA

Aug, 15

Real root isolation for univariate polynomials on GPUs and multicores

I participate to the elaboration of the library cumodp. My objective is to develop code for the exact calculation of the real roots of univariate polynomials. Stating this problem is very easy. However, as one dives into the details, one realizes that there are lots of challenges in order to reach highly efficient algorithmic and […]

CUDA

Aug, 15

GPU-Assisted Cryptography of Log-Structured Indices

General purpose programming of Graphics Processing Units (GPUs) is a relatively new technological advancement. GPUs contain vast amounts of computational power with their many core architectures. Within many computer systems the power of these GPUs often goes unused outside the realm of graphics. Many of today’s common computational tasks are well suited for the single […]

CUDA

Aug, 15

A New Cooperative Evolutionary Multi-Swarm Optimizer Algorithm Based on CUDA Parallel Architecture Applied to Solve Engineering Optimization Problems

This paper presents a new Cooperative Evolutionary MultiSwarm Optimization Algorithm (CEMSO-GPU) based on CUDA parallel architecture applied to solve engineering problems. The focus on this approach is: The use of the concept of master/slave swarm with a mechanism of sharing data; and, the parallelism method based on the paradigm of General Purpose Computing on Graphics […]

CUDA

Aug, 15

Optimizing the Computation of Eigenvalues Using Graphics Processing Units

In this paper, we first briefly describe some mathematical aspects regarding the computation of eigenvalues, followed by an original approach: a bisection algorithm useful in computing eigenvalues for a tridiagonal symmetric matrix of arbitrary size, using the computing capabilities of the latest graphics processing units that incorporate the Compute Unified Device Architecture. The novel approach […]

CUDA

Aug, 14

Performance of FORTRAN and C GPU Extensions for a Benchmark Suite of Fourier Pseudospectral Algorithms

A comparison of PGI OpenACC, FORTRAN CUDA, and Nvidia CUDA pseudospectral methods on a single GPU and GCC FORTRAN on single and multiple CPU cores is reported. The GPU implementations use CuFFT and the CPU implementations use FFTW. Porting pre-existing FORTRAN codes to utilize a GPUs is efficient and easy to implement with OpenACC and […]

CUDA

Aug, 14

Accelerating cellular automata simulations using AVX and CUDA

We investigated various methods of parallelization of the Frish-Hasslacher-Pomeau (FHP) cellular automata algorithm for modeling fluid flow. These methods include SSE, AVX, and POSIX Threads for central processing units (CPUs) and CUDA for graphics processing units (GPUs). We present implementation details of the FHP algorithm based on AVX/SSE and CUDA technologies. We found that (a) […]

CUDA

Aug, 14

Dynamic Warp Resizing in High-Performance SIMT

Modern GPUs synchronize threads grouped in a warp at every instruction. These results in improving SIMD efficiency and makes sharing fetch and decode resources possible. The number of threads included in each warp (or warp size) affects divergence, synchronization overhead and the efficiency of memory access coalescing. Small warps reduce the performance penalty associated with […]

Aug, 14

A Second-Order Distributed Trotter-Suzuki Solver with a Hybrid Kernel

The Trotter-Suzuki approximation leads to an efficient algorithm for solving the time-dependent Schroedinger equation. Using existing highly optimized CPU and GPU kernels, we developed a distributed version of the algorithm that runs efficiently on a cluster. Our implementation also improves single node performance, and is able to use multiple GPUs within a node. The scaling […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

DuoReduce: MLIR's benchmark

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

High accuracy solutions to energy gradient flows from material science models

Parametric Flows: Automated Behavior Equivalencing for Symbolic Analysis of Races in CUDA Programs

Efficient GPU implementation of parameter estimation of a statistical model for online advertisement optimization

vCUDA Framework Development for GPU Virtualization

Real root isolation for univariate polynomials on GPUs and multicores

GPU-Assisted Cryptography of Log-Structured Indices

A New Cooperative Evolutionary Multi-Swarm Optimizer Algorithm Based on CUDA Parallel Architecture Applied to Solve Engineering Optimization Problems

Optimizing the Computation of Eigenvalues Using Graphics Processing Units

Performance of FORTRAN and C GPU Extensions for a Benchmark Suite of Fourier Pseudospectral Algorithms

Accelerating cellular automata simulations using AVX and CUDA

Dynamic Warp Resizing in High-Performance SIMT

A Second-Order Distributed Trotter-Suzuki Solver with a Hybrid Kernel

Recent source codes

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Most viewed papers (last 30 days)