Posts
Feb, 6
Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering
We present Dr.Jit, a domain-specific just-in-time compiler for physically based rendering and its derivative. Dr.Jit traces high-level programs (e.g., written in Python) and compiles them into efficient CPU or GPU megakernels. It achieves state-of-the-art performance thanks to global optimizations that specialize code generation to the rendering or optimization task at hand. While Dr.Jit drastically simplifies […]
Feb, 6
SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets
Today’s scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but […]
Feb, 6
Porting OpenACC to OpenMP on heterogeneous systems
This documentation is designed for beginners in Graphics Processing Unit (GPU)-programming and who want to get familiar with OpenACC and OpenMP offloading models. Here we present an overview of these two programming models as well as of the GPU-architectures. Specifically, we provide some insights into the functionality of these models and perform experiments involving different […]
Feb, 6
GC3: An Optimizing Compiler for GPU Collective Communication
Machine learning models made up of millions or billions of parameters are often trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, the collective communications used in these applications becomes a bottleneck. Custom collective algorithms optimized for both particular network topologies and application specific communication patterns can […]
Jan, 30
Performance prediction of deep learning applications training in GPU as a service systems
Data analysts predict that the GPU as a Service (GPUaaS) market will grow from US$700 million in 2019 to $7 billion in 2025 with a compound annual growth rate of over 38% to support 3D models, animated video processing, and gaming. GPUaaS adoption will be also boosted by the use of graphics processing units (GPUs) […]
Jan, 30
Teaching Parallel Programming in Containers: Virtualization of a Heterogeneous Local Infrastructure
Providing parallel programming education is an emerging challenge, requires teaching approaches to further the learning process and a complex infrastructure to provide a suitable environment for the laboratory practical classes. Do not prioritize parallel programming requirements in future computing professionals learning can lead to a significant training gap, negatively impacting the efficient use of current […]
Jan, 30
Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs
More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, […]
Jan, 30
GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration
Graph neural networks (GNNs) have recently exploded in popularity thanks to their broad applicability to ubiquitous graph-related problems such as quantum chemistry, drug discovery, and high energy physics. However, meeting demand for novel GNN models and fast inference simultaneously is challenging because of the gap between the difficulty in developing efficient FPGA accelerators and the […]
Jan, 30
Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU
In a general graph data structure like an adjacency matrix, when edges are homogeneous, the connectivity of two nodes can be sufficiently represented using a single bit. This insight has, however, not yet been adequately exploited by the existing matrix-centric graph processing frameworks. This work fills the void by systematically exploring the bit-level representation of […]
Jan, 23
A tool set for random number generation on GPUs in R
We introduce the R package clrng which leverages the gpuR package and is able to generate random numbers in parallel on a Graphics Processing Unit (GPU) with the clRNG (OpenCL) library. Parallel processing with GPU’s can speed up computationally intensive tasks, which when combined with R, it can largely improve R’s downsides in terms of […]
Jan, 23
Reusing Auto-Schedules for Efficient DNN Compilation
Auto-scheduling is a process where a search algorithm automatically explores candidate schedules (program transformations) for a given tensor program on a given hardware platform to improve its performance. However this can be a very time consuming process, depending on the complexity of the tensor program, and capacity of the target device, with often many thousands […]
Jan, 23
Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment
GPU (Graphics Processing Unit) computing is one of the most popular accelerating methods for various high-performance computing applications. For scientific computations based on multi-physical phenomena, however, a single device solution on a GPU is insufficient, where the single timescale or degree of parallelism is not simply supported by a simple GPU-only solution. We have been […]