high performance computing on graphics processing units: hgpu.org

Posts

Feb, 22

Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs

In this paper, we present an approach to estimate GPU applications’ performance upper bound based on algorithm analysis and assembly code level benchmarking. As an example, we analyze the potential peak performance of SGEMM (Single-precision General Matrix Multiply) on Fermi (GF110) and Kepler (GK104) GPUs. We try to answer the question of how much optimization […]

CUDA

Feb, 22

An efficient scheduling scheme using estimated execution time for heterogeneous computing systems

Computing systems should be designed to exploit parallelism in order to improve performance. In general, a GPU (Graphics Processing Unit) can provide more parallelism than a CPU (Central Processing Unit), resulting in the wide usage of heterogeneous computing systems that utilize both the CPU and the GPU together. In the heterogeneous computing systems, the efficiency […]

CUDA

Feb, 22

A Parallel Active-Set Method for Solving Frictional Contact Problems

Simulating frictional contact is a challenging computational task and there exist a variety of techniques to do so. One such technique, the staggered projections algorithm, requires the solution of two convex quadratic program (QP) subproblems at each iteration. We introduce a method, SCHURPA, which employs a primal-dual active-set strategy to efficiently solve these QPs based […]

CUDA

Feb, 22

Investigating performance variations of an optimized GPU-ported granulometry algorithm

In this article, we present an optimized GPU implementation of a granulometry algorithm which is used a lot in the study of material domain. The main contribution to this algorithm is the binarization of the input data which increases throughput while reducing data allocated memory space. Also, the optimized GPU implementation brings an order of […]

CUDA

Feb, 22

GPU-based Motion Planning under Uncertainties using POMDP

We present a novel GPU-based parallel algorithm to solve continuous-state POMDP problems. We choose the MCVI (Monte Carlo Value Iteration) method as our base algorithm [1], and parallelize this algorithm using multi-level parallel formulation of MCVI. For each parallel level, we propose efficient algorithms to effectively utilize the massive data parallelism of GPUs. To obtain […]

CUDA

Feb, 21

Scheduling a Parallel Sparse Direct Solver to Multiple GPUs

We present a sparse direct solver using multilevel task scheduling on a modern heterogeneous compute node consisting of a multi-core host processor and multiple GPU accelerators. Our direct solver is based on the multifrontal method, which is characterized by exploiting dense subproblems (fronts) related in an assembly tree. Critical to high performance of the solver […]

CUDA

Feb, 21

Chrono: a parallel multi-physics library for rigid-body, flexible-body, and fluid dynamics

The last decade witnessed a manifest shift in the microprocessor industry towards chip designs that promote parallel computing. Until recently the privilege of a select group of large research centers, Teraflop computing is becoming a commodity owing to inexpensive GPU cards and multi to many-core x86 processors. This paradigm shift towards large scale parallel computing […]

CUDA

Feb, 21

Computation of Air-Vortices Based on GPU Technology: Optimizing and Parallelizing a Model for Wake-Vortex Prediction Using OpenCL

This thesis details the refinement and numerical solution of a preexisting model for predicting the strengths and positions of so-called wake-vortices that are generated from the lift of heavy aircraft. The ultimate objective is to implement a numerical scheme for the model that is fast enough to allow for probabilistic methods, such as Monte Carlosimulations, […]

OpenCL

Feb, 21

GamePipe: A Virtualized Cloud Platform Design and Performance Evaluation

Cloud gaming provides game-on-demand (GoD) services over the Internet cloud. The goal is to achieve faster response time and higher QoS. The video game is rendered remotely on the game cloud and decoded on thin client devices such as tablet computer or smartphone. We design a game cloud with a virtualized cluster of CPU/GPU servers […]

CUDA

Feb, 21

Ray Tracing on GPUs

The ray tracing method aims for producing realistic and high-quality images of a scene described by geometric primitives such as triangles, spheres, etc. The basic idea is quiet simple and allows for straight forward implementations of this technique on the computer. At its core is a set of rays, each of which corresponding to one […]

CUDA

Feb, 20

Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU

The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristic methods to determine the new data layouts to create. As a result, they either […]

CUDA

Feb, 20

An abstract object oriented runtime system for heterogeneous parallel architecture

In our paper we present an abstract object oriented runtime system that helps to develop scientific application for new hererogenous architecture based on multi-node of multi-core processors enhanced with accelerator boards. Its architecture based on abstract concepts enables to follow hardware technology by extending them with new implementations modeling new hardware components, while limiting the […]

CUDA

•

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs

An efficient scheduling scheme using estimated execution time for heterogeneous computing systems

A Parallel Active-Set Method for Solving Frictional Contact Problems

Investigating performance variations of an optimized GPU-ported granulometry algorithm

GPU-based Motion Planning under Uncertainties using POMDP

Scheduling a Parallel Sparse Direct Solver to Multiple GPUs

Chrono: a parallel multi-physics library for rigid-body, flexible-body, and fluid dynamics

Computation of Air-Vortices Based on GPU Technology: Optimizing and Parallelizing a Model for Wake-Vortex Prediction Using OpenCL

GamePipe: A Virtualized Cloud Platform Design and Performance Evaluation

Ray Tracing on GPUs

Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU

An abstract object oriented runtime system for heterogeneous parallel architecture

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)