Posts
Aug, 30
CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM
The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that […]
Aug, 30
8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks
Performance optimization can be a daunting task especially as the hardware architecture becomes more and more complex. This paper takes a kernel from the Materials Science code BerkeleyGW, and demonstrates a few performance analysis and optimization techniques. Despite challenges such as high register usage, low occupancy, complex data access patterns, and the existence of several […]
Aug, 27
The 18th International Conference on High Performance Computing & Simulation (HPCS), 2020
The 2020 International Conference on High Performance Computing & Simulation (HPCS 2020) will be held on December 10-14, 2020 in Barcelona, Spain (virtually). Under the theme of “HPC and Modeling & Simulation for the 21st Century,” HPCS 2020 will focus on a wide range of the state-of-the-art as well as emerging topics pertaining to high […]
Aug, 23
Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs
Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the […]
Aug, 23
Compiler-Based Tools to Aid in Data Transfer Optimization and On-Chip Debug of Heterogeneous Compute Systems
First, we present techniques to efficiently schedule data transfers through compiler analyses. Compared to transferring data immediately before and after the kernel executes, our scheduling results in orders of magnitude improvements in execution time, number of data transfers, and number of bytes transferred. Second, we demonstrate techniques to provide on-chip debugging for heterogeneous systems through […]
Aug, 23
Modular FPGA Systems with Support for Dynamic Workloads and Virtualisation
This thesis shows that it is feasible to build modular FPGA systems which can dynamically change the hardware resources in the spatial and the temporal domains using existing tools and accelerators, to improve maintainability, adaptability, and accessibility for FPGA systems. To achieve this, first, a modular FPGA development flow is proposed to build an FPGA […]
Aug, 23
AIPerf: Automated machine learning as an AI-HPC benchmark
The plethora of complex artificial intelligence (AI) algorithms and available high performance computing (HPC) power stimulates the convergence of AI and HPC. The expeditious development of AI components, in both hardware and software domain, increases the system heterogeneity, which prompts the challenge on fair and comprehensive benchmarking. Existing HPC and AI benchmarks fail to cover […]
Aug, 23
Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations
GPU accelerators have become an important backbone for scientific high performance computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA’s newest server line GPU, the A100 architecture part of the Ampere generation. Specifically, we assess its performance for sparse linear algebra […]
Aug, 9
Heterogeneous parallel computing for image registration and linear algebra applications
This doctoral thesis focuses on GPU acceleration of medical image registration and sparse general matrix-matrix multiplication (SpGEMM). The comprehensive work presented here aims to enable new possibilities in Image Guided Surgery (IGS). IGS provides the surgeon with advanced navigation tools during surgery. Image registration, which is a part of IGS, is computationally demanding, therefore GPU […]
Aug, 9
Ignite-GPU: a GPU-enabled in-memory computing architecture on clusters
During recent years, big data explosion and the increase in main memory capacity, on the one hand, and the need for faster data processing, on the other hand, have caused the development of various in-memory processing tools to manage and analyze data. Engaging the speed of the main memory and advantaging data locality, these tools […]
Aug, 9
Parallel acceleration of CPU and GPU range queries over large data sets
Data management systems commonly use bitmap indices to increase the efficiency of querying scientific data. Bitmaps are usually highly compressible and can be queried directly using fast hardware-supported bitwise logical operations. The processing of bitmap queries is inherently parallel in structure, which suggests they could benefit from concurrent computer systems. In particular, bitmap-range queries offer […]
Aug, 9
PERCH 2.0: Fast and Accurate GPU-based Perception via Search for Object Pose Estimation
Pose estimation of known objects is fundamental to tasks such as robotic grasping and manipulation. The need for reliable grasping imposes stringent accuracy requirements on pose estimation in cluttered, occluded scenes in dynamic environments. Modern methods employ large sets of training data to learn features in order to find correspondence between 3D models and observed […]