26197

Posts

Jan, 30

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, […]
Jan, 30

GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration

Graph neural networks (GNNs) have recently exploded in popularity thanks to their broad applicability to ubiquitous graph-related problems such as quantum chemistry, drug discovery, and high energy physics. However, meeting demand for novel GNN models and fast inference simultaneously is challenging because of the gap between the difficulty in developing efficient FPGA accelerators and the […]
Jan, 30

Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU

In a general graph data structure like an adjacency matrix, when edges are homogeneous, the connectivity of two nodes can be sufficiently represented using a single bit. This insight has, however, not yet been adequately exploited by the existing matrix-centric graph processing frameworks. This work fills the void by systematically exploring the bit-level representation of […]
Jan, 23

A tool set for random number generation on GPUs in R

We introduce the R package clrng which leverages the gpuR package and is able to generate random numbers in parallel on a Graphics Processing Unit (GPU) with the clRNG (OpenCL) library. Parallel processing with GPU’s can speed up computationally intensive tasks, which when combined with R, it can largely improve R’s downsides in terms of […]
Jan, 23

Reusing Auto-Schedules for Efficient DNN Compilation

Auto-scheduling is a process where a search algorithm automatically explores candidate schedules (program transformations) for a given tensor program on a given hardware platform to improve its performance. However this can be a very time consuming process, depending on the complexity of the tensor program, and capacity of the target device, with often many thousands […]
Jan, 23

Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment

GPU (Graphics Processing Unit) computing is one of the most popular accelerating methods for various high-performance computing applications. For scientific computations based on multi-physical phenomena, however, a single device solution on a GPU is insufficient, where the single timescale or degree of parallelism is not simply supported by a simple GPU-only solution. We have been […]
Jan, 23

NNP/MM: Fast molecular dynamics simulations with machine learning potentials and molecular mechanics

Parametric and non-parametric machine learning potentials have emerged recently as a way to improve the accuracy of bio-molecular simulations. Here, we present NNP/MM, an hybrid method integrating neural network potentials (NNPs) and molecular mechanics (MM). It allows to simulate a part of molecular system with NNP, while the rest is simulated with MM for efficiency. […]
Jan, 23

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time. We therefore […]
Jan, 16

Dopia: Online Parallelism Management for Integrated CPU/GPU Architectures

Recent desktop and mobile processors often integrate CPU and GPU onto the same die. The limited memory bandwidth of these integrated architectures can negatively affect the performance of data-parallel workloads when all computational resources are active. The combination of active CPU and GPU cores achieving the maximum performance depends on a workload’s characteristics, making manual […]
Jan, 16

Fancier: A Unified Framework for Java, C, and OpenCL Integration

Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators provide requires the use of specialized native programming interfaces, such as CUDA or OpenCL, or higher-level programming models like OpenMP or OpenACC. However, on managed programming languages, offloading execution into […]
Jan, 16

Research and Development of Porting SYCL on QNX Operating System for High Parallelism

As a standard C++ programming model, SYCL has gained popularity on incorporating various parallel computing frameworks. With the development of hardware technologies, low-level computing devices are becoming increasingly varied and thus result in the great heterogeneity of hardware. Although many computing frameworks, such as OpenCL, OpenMP and CUDA, can benefit to heterogeneous computing, they increase […]
Jan, 16

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: