high performance computing on graphics processing units: hgpu.org

Posts

Sep, 15

Algorithmic GPGPU Memory Optimization

The performance of General-Purpose computation on Graphics Processing Units (GPGPU) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on GPUs and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved […]

OpenCL

Sep, 15

Expressed Sequence Tag Clustering using Commercial Gaming Hardware

In this dissertation we had the aim of utilizing GPU technology in order to optimize and improve on the problem of EST clustering. Extensive research on this cross-disciplinary approach was required before even considering such an approach. It was found that though this line of research has not received significant attention, there are significant gains […]

CUDA

Sep, 15

Porting to the Intel Xeon Phi: Opportunities and Challenges

This work describes the challenges presented by porting code to the Intel Xeon Phi coprocessor, as well as opportunities for optimization and tuning. We use micro-benchmarks, code segments, assembly listings and application level results to illustrate the key issues in porting to the Xeon Phi coprocessor, always keeping in mind both portability and performance. While […]

CUDA

Sep, 15

Quine-McCluskey algorithm on GPGPU

This paper deals with parallelization of the Quine-McCluskey algorithm. This boolean function minimization algorithm has a limitation when dealing with more than four variables. The problem computed by this algorithm is NP-hard and runtime of the algorithm grows exponentially with the number of variables. The goal is to show that parallel implementation of the Quine-McCluskey […]

CUDA

Sep, 14

A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling

Computational Neuroscience is an emerging field that provides unique opportunities to study complex brain structures through realistic neural simulations. However, as biological details are added to models, the execution time for the simulation becomes longer. Graphics Processing Units (GPUs) are now being utilized to accelerate simulations due to their ability to perform computations in parallel. […]

CUDA

Sep, 14

GPU-based Parallel Reservoir Simulators

We have developed a GPU-based parallel linear solver package. When solving matrices from reservoir simulation, the parallel solvers are much more efficient than CPU-based linear solvers. However, efforts should be made to improve the setup phase of domain decomposition, the factorization of ILUT and parallelism of block ILUT preconditioner.

CUDA

Sep, 14

A GPU-based Affine and Scale Invariant Feature Transform Algorithm

Affine invariance is one of the main performances of a good feature extraction algorithm. SIFT is a kind of scale-invariant feature extraction algorithm, but it is not affine invariant. In order to improve SIFT algorithm’s affine invariance. Affine and Scale Invariant Feature Transform (ASIFT) algorithm takes affine Model into SIFT. However, serial ASIFT algorithm’s computing […]

CUDA

Sep, 14

A GPU Accelerated BiConjugate Gradient Stabilized Solver for Speeding-up Large Scale Model Evaluation

Solving linear systems remains a key activity in of economics modelling, therefore making fast and accurate methods for computing solutions highly desirable. In this paper, a proof of concept C++ AMP implementation of an iterative method for solving linear systems, BiConjugate Gradient Stabilized (henceforth BiCGSTAB), is presented. The method relies on matrix and vector operations, […]

OpenCL

Sep, 14

Efficient CUDA polynomial preconditioned Conjugate Gradient solver for Finite Element computation of elasticity problems

Graphics Processing Unit (GPU) has obtained great success in scientific computations for its tremendous computational horsepower and very high memory bandwidth. This paper discusses the efficient way to implement polynomial preconditioned conjugate gradient solver for the finite element computation of elasticity on NVIDIA GPUs using Compute Unified Device Architecture (CUDA). Sliced Block ELLPACK (SBELL) format […]

CUDA

Sep, 13

FuzzyGPU: a fuzzy arithmetic library for GPU

Data are traditionally represented using native format such as integer or floating-point numbers in various flavor. However, some applications rely on more complex representation format. This is the case when uncertainty needs to be apprehended. Fuzzy arithmetic is one of the major tools to address this problem, but the execution time of basic operations such […]

CUDA

Sep, 13

Increasing GPU Throughput using Kernel Interleaved Thread Block Scheduling

The number of active threads required to achieve peak application throughput on graphics processing units (GPUs) depends largely on the ratio of time spent on computation to the time spent accessing data from memory. While compute-intensive applications can achieve peak throughput with a low number of threads, memory-intensive applications might not achieve good throughput even […]

CUDA

Sep, 13

An Interface for Halo Exchange Pattern

Halo exchange patterns are very common in scientific computing, since the solution of PDEs often requires communication between neighbor points. Although this is a common pattern, implementations are often made by programmers from scratch, with an accompanying feeling of "reinventing the wheel". In this paper we describe GCL, a C++ generic library that implements a […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Algorithmic GPGPU Memory Optimization

Expressed Sequence Tag Clustering using Commercial Gaming Hardware

Porting to the Intel Xeon Phi: Opportunities and Challenges

Quine-McCluskey algorithm on GPGPU

A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling

GPU-based Parallel Reservoir Simulators

A GPU-based Affine and Scale Invariant Feature Transform Algorithm

A GPU Accelerated BiConjugate Gradient Stabilized Solver for Speeding-up Large Scale Model Evaluation

Efficient CUDA polynomial preconditioned Conjugate Gradient solver for Finite Element computation of elasticity problems

FuzzyGPU: a fuzzy arithmetic library for GPU

Increasing GPU Throughput using Kernel Interleaved Thread Block Scheduling

An Interface for Halo Exchange Pattern

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)