Posts
Sep, 26
IgNet. A Super-precise Convolutional Neural Network
Convolutional neural networks (CNN) are known to be an effective means to detect and analyze images. Their power is essentially based on the ability to extract out images common features. There exist, however, images involving unique, irregular features or details. Such is a collection of unusual children drawings reflecting the kids imagination and individuality. These […]
Sep, 26
An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads
Task graph parallelism has emerged as an important tool to efficiently execute large machine learning workloads on GPUs. Users describe a GPU workload in a task dependency graph rather than aggregated GPU operations and dependencies, allowing the runtime to run whole-graph scheduling optimization to significantly improve the performance. While the new CUDA graph execution model […]
Sep, 26
CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research
Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What […]
Sep, 19
A readahead prefetcher for GPU file system layer
GPUs are broadly used in I/O-intensive big data applications. Prior works demonstrate the benefits of using GPU-side file system layer, GPUfs, to improve the GPU performance and programmability in such workloads. However, GPUfs fails to provide high performance for a common I/O pattern where a GPU is used to process a whole data set sequentially. […]
Sep, 19
Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability
High-performance computing (HPC) is a major driver accelerating scientific research and discovery, from quantum simulations to medical therapeutics. The growing number of new HPC systems coming online are being furnished with various hardware components, engineered by competing industry entities, each having their own architectures and platforms to be supported. While the increasing availability of these […]
Sep, 19
Measurement and Analysis of GPU-accelerated Applications with HPCToolkit
To address the challenge of performance analysis on the US DOE’s forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications. To help developers understand the performance of accelerated applications as a whole, HPCToolkit’s measurement and analysis tools attribute metrics to calling contexts that span […]
Sep, 19
GPU Algorithms for Efficient Exascale Discretizations
In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek […]
Sep, 19
A Study of Mixed Precision Strategies for GMRES on GPUs
Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision strategies […]
Sep, 5
Supporting CUDA for an extended RISC-V GPU architecture
With the rapid development of scientific computation, more and more researchers and developers are committed to implementing various workloads/operations on different devices. Among all these devices, NVIDIA GPU is the most popular choice due to its comprehensive documentation and excellent development tools. As a result, there are abundant resources for hand-writing high-performance CUDA codes. However, […]
Sep, 5
Data-Oriented Language Implementation of Lattice-Boltzmann Method for Dense and Sparse Geometries
The performance of lattice-Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a new approach called the data-oriented language, which allows the combining of complex […]
Sep, 5
WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU
Deep reinforcement learning (RL) is a powerful framework to train decision-making models in complex dynamical environments. However, RL can be slow as it learns through repeated interaction with a simulation of the environment. Accelerating RL requires both algorithmic and engineering innovations. In particular, there are key systems engineering bottlenecks when using RL in complex environments […]
Sep, 5
LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs
Exploiting data locality in GPGPUs is critical for efficiently using the smaller data caches and handling the memory bottleneck problem. This paper proposes a thread block-centric locality analysis, which identifies the locality among the thread blocks (TBs) in terms of a number of common data references. In LocalityGuru, we seek to employ a detailed just-in-time […]