24940

Posts

May, 2

Performance analysis and optimization of highly diverging algorithms on GPUs

In this thesis, the performance of the IceCube projects photon propagation code (clsim) is optimized. The process of GPU code analysis and performance optimization is described in detail. When run on the same hardware, the new version achieves a speedup of about 3x over the original implementation. Comparing the unmodified code on hardware currently used […]
May, 2

Easy and Efficient Transformer: Scalable Inference Solution For large NLP mode

The ultra-large-scale pre-training model can effectively improve the effect of a variety of tasks, and it also brings a heavy computational burden to inference. This paper introduces a series of ultra-large-scale pre-training model optimization methods that combine algorithm characteristics and GPU processor hardware characteristics, and on this basis, propose an inference engine — Easy and […]
May, 2

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Fast Fourier Transform (FFT) is an essential tool in scientific and engineering computation. The increasing demand for mixed-precision FFT has made it possible to utilize half-precision floating-point (FP16) arithmetic for faster speed and energy saving. Specializing in lower precision, NVIDIA Tensor Cores can deliver extremely high computation performance. However, the fixed computation pattern makes it […]
Apr, 25

How to Train BERT with an Academic Budget

GPUs are now used for a wide range of problems within HPC. However, making efficient use of the computational power available with multiple GPUs is challenging. The main challenges in achieving good performance are memory layout, affecting memory bandwidth, effective use of the memory spaces with a GPU, inter-GPU communication, and synchronization. We address these […]
Apr, 25

Deep Graph Learning for Program Analysis and System Optimization

It has been increasingly challenging for the compilers to cope with the evolving computer architectures. The manually written compiler heuristics are not sufficiently wise to capture the impact of data and hardware related dependencies on performance. However, machine learning offers an opportunity to learn the common patterns in the existing dataset and predict the future […]
Apr, 25

Ripple: Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout

GPUs are now used for a wide range of problems within HPC. However, making efficient use of the computational power available with multiple GPUs is challenging. The main challenges in achieving good performance are memory layout, affecting memory bandwidth, effective use of the memory spaces with a GPU, inter-GPU communication, and synchronization. We address these […]
Apr, 25

CryptGPU: Fast Privacy-Preserving Machine Learning on the GPU

We introduce CryptGPU, a system for privacy-preserving machine learning that implements all operations on the GPU (graphics processing unit). Just as GPUs played a pivotal role in the success of modern deep learning, they are also essential for realizing scalable privacy-preserving deep learning. In this work, we start by introducing a new interface to losslessly […]
Apr, 25

Performance Analysis and Optimization Opportunities for NVIDIA Automotive GPUs

Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD) bring unprecedented performance requirements for automotive systems. Graphic Processing Unit (GPU) based platforms have been deployed with the aim of meeting these requirements, being NVIDIA Jetson TX2 and its high-performance successor, NVIDIA AGX Xavier, relevant representatives. However, to what extent high-performance GPU configurations are appropriate for […]
Apr, 18

SLATE port to AMD and Intel platforms

SLATE implements GPU-accelerated linear algebra, relying primarily on vendor-provided GPU BLAS for performance, in particular batched BLAS routines. Initially, SLATE was written using NVIDIA’s CUDA and cuBLAS for GPU acceleration. At the time that the SLATE project was started, it was unclear what GPU technologies would exist for other platforms [1]. Since then, AMD has […]
Apr, 18

FANS: FPGA-Accelerated Near-Storage Sorting

Large-scale sorting is always an important yet demanding task for data center applications. In addition to powerful processing capability, high-performance sorting system requires efficient utilization of the available bandwidth of various levels in the memory hierarchy. Nowadays, with the explosive data size, the frequent data transfers between the host and the storage device are becoming […]
Apr, 18

Under the Hood of SYCL – An Initial Performance Analysis With an Unstructured-mesh CFD Application

As the computing hardware landscape gets more diverse, and the complexity of hardware grows, the need for a general purpose parallel programming model capable of developing (performance) portable codes have become highly attractive. Intel’s OneAPI suite, which is based on the SYCL standard aims to fill this gap using a modern C++ API. In this […]
Apr, 18

Efficient Large-Scale Language Model Training on GPU Clusters

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required […]

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: