Posts
Oct, 4
Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores
Sparse general matrix-matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM challenging. Modern GPUs include Tensor Core Units (TCUs), which specialize in dense matrix multiplication. Our aim is to re-purpose TCUs for sparse matrices. […]
Sep, 27
Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters
Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming […]
Sep, 27
Extending High-Level Synthesis for Task-Parallel Programs
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of result (QoR) and short development cycle compared with the traditional register-transfer level (RTL) design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt […]
Sep, 27
Performance Evaluation of Mixed Precision Algorithms for Solving Sparse Linear Systems
It is well established that mixed precision algorithms that factorize a matrix at a precision lower than the working precision can reduce the execution time and the energy consumption of parallel solvers for dense linear systems. Much less is known about the efficiency of mixed precision parallel algorithms for sparse linear systems, and existing work […]
Sep, 27
Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments
In this work, we adapt a reconfigurable computer system based on FPGA technologies to OpenCL programming environments. The reconfigurable system is part of a compute prototype of the MANGO European project that includes 96 FPGAs. To optimize the use and to obtain its maximum performance, it is essential to adapt it to heterogeneous systems programming […]
Sep, 27
RoadRunner: a fast and flexible exoplanet transit model
I present RoadRunner, a fast exoplanet transit model that can use any radially symmetric function to model stellar limb darkening while still being faster to evaluate than the analytical transit model for quadratic limb darkening by Mandel & Agol (2002). CPU and GPU implementations of the model are available in the PyTransit transit modelling package, […]
Sep, 20
Applications of Deep Neural Networks
Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to create neural networks that can handle tabular data, images, text, and audio as both input and output. Deep learning allows a neural network to learn hierarchies […]
Sep, 20
Designing Efficient Barriers and Semaphores for Graphics Processing Units
General-purpose GPU applications that use fine-grained synchronization to enforce ordering between many threads accessing shared data have become increasingly popular. Thus, it is imperative to create more efficient GPU synchronization primitives for these applications. Accordingly, in recent years there has been a push to establish a single, unified set of GPU synchronization primitives. However, unlike […]
Sep, 20
A Comparison of Optimal Scanline Voxelization Algorithms
This thesis presents a comparison between different algorithms for optimal scanline voxelization of 3D models. As the optimal scanline relies on line voxelization, three such algorithms were evaluated. These were Real Line Voxelization (RLV), Integer Line Voxelization (ILV) and a 3D Bresenham line drawing algorithm. RLV and ILV were both based on voxel traversal by […]
Sep, 20
WarpCore: A Library for fast Hash Tables on GPUs
Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In […]
Sep, 20
PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems
In the past decade, high performance compute capabilities exhibited by heterogeneous GPGPU platforms have led to the popularity of data parallel programming languages such as CUDA and OpenCL. Such languages, however, involve a steep learning curve as well as developing an extensive understanding of the underlying architecture of the compute devices in heterogeneous platforms. This […]
Sep, 13
Tools for GPU Computing–Debugging and Performance Analysis of Heterogenous HPC Applications
General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC […]