27448

Posts

Nov, 6

Enabling Data Movement and Computation Pipelining in Deep Learning Compiler

Pipelining between data loading and computation is a critical tensor program optimization for GPUs. Multi-stage pipelining across the multi-level buffer hierarchy of GPU is particularly indispensable on the latest NVIDIA Ampere GPUs to reduce resource idleness and guarantee kernel performance. Currently, people rely on libraries written by experts such as cuBLAS to access the pipelining […]
Nov, 6

Using scheduling entropy amplification in CUDA/OpenMP code to exhibit non-reproducibility issues

Rounding error or cancellation that appears with each floating-point operations, combined with the lack of control over execution order in parallel code leads to numerical issues such as numerical reproducibility. In order to enhance the possibility to discover such numerical issue, in this article we propose a simple solution base on an index interposer and […]
Nov, 6

An Open-source FPGA Library for Data Sorting

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their flexibility enables the building of application-specific computation pipelines and data supply systems. In addition to the flexibility, toolchains for the development of FPGAs in OpenCL have been developed and offered by FPGA vendors that reduce the programming effort required. However, […]
Nov, 6

Apple Silicon Performance in Scientific Computing

With the release of the Apple Silicon System-on-a-Chip processors, and the impressive performance shown in general use by both the M1 and M1 Ultra, the potential use for Apple Silicon processors in scientific computing is explored. Both the M1 and M1 Ultra are compared to current state-of-the-art data-center GPUs, including an NVIDIA V100 with PCIe, […]
Nov, 6

The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science

We present the Open MatSci ML Toolkit: a flexible, self-contained, and scalable Python-based framework to apply deep learning models and methods on scientific data with a specific focus on materials science and the OpenCatalyst Dataset. Our toolkit provides: 1. A scalable machine learning workflow for materials science leveraging PyTorch Lightning, which enables seamless scaling across […]
Oct, 30

A systematic performance study of the parallel programming framework SkePU 3 using HPC-benchmarks

With hardware performance no longer following Moore’s law, software optimization becomes more important. In this paper, we discuss parallel programming, which is one way to optimize software. However, writing parallel code is considered more difficult than writing sequential code. There is often a specific framework to be used to write parallel code for each type […]
Oct, 30

Benchmarking GPU and TPU Performance with Graph Neural Networks

Many artificial intelligence (AI) devices have been developed to accelerate the training and inference of neural networks models. The most common ones are the Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). They are highly optimized for dense data representations. However, sparse representations such as graphs are prevalent in many domains, including science. It […]
Oct, 30

gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs

As the interest to Graph Neural Networks (GNNs) is growing, the importance of benchmarking and performance characterization studies of GNNs is increasing. So far, we have seen many studies that investigate and present the performance and computational efficiency of GNNs. However, the work done so far has been carried out using a few high-level GNN […]
Oct, 30

Providing performance portable numerics for Intel GPUs

With discrete Intel GPUs entering the high-performance computing landscape, there is an urgent need for production-ready software stacks for these platforms. In this article, we report how we enable the Ginkgo math library to execute on Intel GPUs by developing a kernel backed based on the DPC++ programming environment. We discuss conceptual differences between the […]
Oct, 30

torchode: A Parallel ODE Solver for PyTorch

We introduce an ODE solver for the PyTorch ecosystem that can solve multiple ODEs in parallel independently from each other while achieving significant performance gains. Our implementation tracks each ODE’s progress separately and is carefully optimized for GPUs and compatibility with PyTorch’s JIT compiler. Its design lets researchers easily augment any aspect of the solver […]
Oct, 23

A Ray Tracing Implementation Performance Comparison between the CPU and the GPU

Ray tracing has gained recent popularity due to the advancement of computer hardware capabilities. The algorithm is used as a rendering technique for computer graphics by tracing rays of light to determine the color of a single pixel, thus simulating the physical behavior of light. This study explores the performance differences between the ray tracing […]
Oct, 23

Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA

Exchanging halo data is a common task in modern scientific computing applications and efficient handling of this operation is critical for the performance of the overall simulation. Tausch is a novel header-only library that provides a simple API for efficiently handling these types of data movements. Tausch supports both simple CPU-only systems, but also more […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: