high performance computing on graphics processing units: hgpu.org

Posts

Dec, 19

Accelerated Sparse Matrix Operations in Nonlinear Least Squares Solvers

This thesis focuses on data structures for sparse block matrices and the associated algorithms for performing linear algebra operations that I have developed. Sparse block matrices occur naturally in many key problems, such as Nonlinear LEast Squares (NLS) on graphical models. NLS are used by e.g. Simultaneous Localization and Mapping (SLAM) in robotics, Bundle Adjustment […]

OpenCL

Dec, 19

OpenCL-accelerated Point Feature Histogram and Its Application in Railway Track Point Cloud Data Processing

To meet the requirements of railway track point cloud processing, an OpenCL-accelerated Point Feature Histogram method is proposed using heterogeneous computing to improve the low computation efficiency. According to the characteristics of parallel computing of OpenCL, the data structure for point cloud storage is reconfigured. With the kernel performance analysis by CodeXL, the data reading […]

OpenCL

Dec, 19

Tactics to Directly Map CNN graphs on Embedded FPGAs

Deep Convolutional Neural Networks (CNNs) are the state-of-the-art in image classification. Since CNN feed forward propagation involves highly regular parallel computation, it benefits from a significant speed-up when running on fine grain parallel programmable logic devices. As a consequence, several studies have proposed FPGA-based accelerators for CNNs. However, because of the large computational power required […]

Dec, 19

Improving 3D Lattice Boltzmann Method stencil with asynchronous transfers on many-core processors

CPU-based many-core processors present an alternative to multicore CPU and GPU processors. In particular, the 93-Petaflops Sunway supercomputer, built from clustered many-core processors, has opened a new era for high performance computing that does not rely on GPU acceleration. However, memory bandwidth remains the main challenge for these architectures. This motivates our endeavor for optimizing […]

OpenCL

Dec, 15

POMPEI: Programming with OpenMP4 for Exascale Investigations

The objective of the Programming with OpenMP4 for Exascale Investigations (POMPEI) project is to explore new task-based programming techniques together with data structure centric programming for scientific applications to harness the potential of extreme-scale systems. Tasking is a well established by now approach on such systems as it has been used successfully to handle their […]

Dec, 15

Effective Extensible Programming: Unleashing Julia on GPUs

GPUs and other accelerators are popular devices for accelerating compute-intensive, parallelizable applications. However, programming these devices is a difficult task. Writing efficient device code is challenging, and is typically done in a low-level programming language. High-level languages are rarely supported, or do not integrate with the rest of the high-level language ecosystem. To overcome this, […]

CUDA

Dec, 15

Intra-node Memory Safe GPU Co-Scheduling

GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the utilisation of GPUs by proposing a framework, we refer to as schedGPU, to facilitate intra-node GPU co-scheduling such that a […]

CUDA

Dec, 15

Task Scheduling for Heterogeneous Multicore Systems

In recent years, as the demand for low energy and high performance computing has steadily increased, heterogeneous computing has emerged as an important and promising solution. Because most workloads can typically run most efficiently on certain types of cores, mapping tasks on the best available resources can not only save energy but also deliver high […]

OpenCL

Dec, 14

A Survey of Techniques for Architecting SLC/MLC/TLC Hybrid Flash Memory based SSDs

Flash memory based SSDs offer several attractive features and benefits compared to hard disk drive (HDD), such as shock resistance, better performance especially for random data access, etc. Depending on the number of bits in each cell, Flash memory can be designed as single/multi/triple level cell (SLC/MLC/TLC) which have different performance, density, cost and write […]

Dec, 10

Acceleration of Cellular Automata through Parallel Computing with OpenCL

Cellular Automata (CA) have its origins in the work of Von Neumann and, since then, have become an important research topic with a wide range of applications, ranging from DNA sequencing to ecological dynamics. One aspect that may be of interest during a CA simulation is the evolution in the number of individuals of each […]

OpenCL

Dec, 10

On algorithmic reductions in task-parallel programming models

Wide adoption of parallel processing hardware in mainstream computing as well as the interest for efficient parallel programming in developer communities increase the demand for programming models that offer support for common algorithmic patterns. An algorithmic pattern of particular interest are reductions. Reductions are iterative memory updates of a program variable and appear in many […]

Dec, 10

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can […]

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

DuoReduce: MLIR's benchmark

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Accelerated Sparse Matrix Operations in Nonlinear Least Squares Solvers

OpenCL-accelerated Point Feature Histogram and Its Application in Railway Track Point Cloud Data Processing

Tactics to Directly Map CNN graphs on Embedded FPGAs

Improving 3D Lattice Boltzmann Method stencil with asynchronous transfers on many-core processors

POMPEI: Programming with OpenMP4 for Exascale Investigations

Effective Extensible Programming: Unleashing Julia on GPUs

Intra-node Memory Safe GPU Co-Scheduling

Task Scheduling for Heterogeneous Multicore Systems

A Survey of Techniques for Architecting SLC/MLC/TLC Hybrid Flash Memory based SSDs

Acceleration of Cellular Automata through Parallel Computing with OpenCL

On algorithmic reductions in task-parallel programming models

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

Recent source codes

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Most viewed papers (last 30 days)