high performance computing on graphics processing units: hgpu.org

Posts

Jun, 30

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism – such as flat or two-level parallelism – and a degree of parallelism that can be statically determined based on the size of the input dataset. However, the effective use of GPUs for algorithms exhibiting complex patterns of parallelism, possibly known only […]

CUDA

Jun, 30

Persistent RNNs: Stashing Recurrent Weights On-Chip

This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit […]

CUDA

Jun, 28

Parallel and Distributed Deep Learning

The goal of this report is to explore ways to parallelize/distribute deep learning in multi-core and distributed setting. We have analyzed (empirically) the speedup in training a CNN using conventional single core CPU and GPU and provide practical suggestions to improve training times. In the distributed setting, we study and analyze synchronous and asynchronous weight […]

CUDA

Jun, 28

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

The sparse triangular solve kernel, SpTRSV, is an important building block for a number of numerical linear algebra routines. Parallelizing SpTRSV on today’s manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, […]

CUDA

•

OpenCL

Jun, 28

Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming Frameworks

Map projections lie at the core of geographic information systems and numerous projections are used today. The reprojection between different map projections is recurring in a geographic information system and it can be parallelized with multi-core CPUs and GPUs. This thesis implements a parallel analytic reprojection algorithm of raster data in C/C++ with the parallel […]

CUDA

•

OpenCL

Jun, 28

Accelerating High-Throughput Computing through OpenCL

As the computational trend diverges from standard CPU computing, to encompass GPUs and other accelerators, the need to integrate these unused resources within existing systems becomes apparent. This paper presents the implementation of a HTCondor pool with GPU execution capabilities through OpenCL. Implementation is discussed from both the system setup and the software design standpoint. […]

OpenCL

Jun, 28

GPU Based Real-Time Welding Simulation with Smoothed-Particle Hydrodynamics

Welding training is essential in the development of industrialization. A good welder will build robust workpieces that ensure the safety and stability of the product. However, training a welder requires lots of time and access professional welding equipment. Therefore, it is desirable to have a training system that is economical and easy to use. After […]

CUDA

•

OpenGL

Jun, 22

Efficient and High-quality Sparse Graph Coloring on the GPU

Graph coloring has been broadly used to discover concurrency in parallel computing. To speedup graph coloring for large-scale datasets, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations either have limited performance or yield unsatisfactory coloring quality (too many colors assigned). We present a work-efficient parallel graph coloring implementation on GPUs with […]

CUDA

Jun, 22

Efficient and portable multi-tasking for heterogeneous systems

Modern computing systems comprise heterogeneous designs which combine multiple and diverse architectures on a single system. These designs provide potentials for high performance under reduced power requirements but require advanced resource management and workload scheduling across the available processors. Programmability frameworks, such as OpenCL and CUDA, enable resource management and workload scheduling on heterogeneous systems. […]

OpenCL

Jun, 22

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor contractions constitute a key computational ingredient of numerical multi-linear algebra. However, as the order and dimension of tensors grow, the time and space complexities of tensor-based computations grow quickly. Existing approaches for tensor contractions typically involves explicit copy and transpose operations. In this paper, we propose and evaluate a new BLAS-like primitive STRIDEDBATCHEDGEMM that […]

CUDA

Jun, 22

CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis

Designing and implementing efficient, provably correct parallel neural network processing is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. However, the diversity and large-scale data size have posed a significant challenge to construct a flexible and high-performance […]

CUDA

•

OpenCL

Jun, 22

Soft GPGPUs for Embedded FPGAs: An Architectural Evaluation

We present a customizable soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. Issues related to scaling the overlay architecture to multiple GPGPU multiprocessors are considered along with application-class architectural optimizations. The overlay architecture is optimized for FPGA implementation to support efficient use of […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

Persistent RNNs: Stashing Recurrent Weights On-Chip

Parallel and Distributed Deep Learning

A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves

Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming Frameworks

Accelerating High-Throughput Computing through OpenCL

GPU Based Real-Time Welding Simulation with Smoothed-Particle Hydrodynamics

Efficient and High-quality Sparse Graph Coloring on the GPU

Efficient and portable multi-tasking for heterogeneous systems

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis

Soft GPGPUs for Embedded FPGAs: An Architectural Evaluation

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)