Posts
Nov, 12
Best Practice Guide – GPGPU
Graphics Processing Units (GPUs) were originally developed for computer gaming and other graphical tasks, but for many years have been exploited for general purpose computing across a number of areas. They offer advantages over traditional CPUs because they have greater computational capability, and use high-bandwidth memory systems (where memory bandwidth is the main bottleneck for […]
Nov, 12
Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II
The High Performance Computing (HPC) community recognizes energy consumption as a major problem. Extensive research is underway to identify means to increase energy efficiency of HPC systems including consideration of alternative building blocks for future systems. This thesis considers one such system, the Texas Instruments Keystone II, a heterogeneous Low-Power System-on-Chip (LPSoC) processor that combines […]
Nov, 12
Performance Evaluation of Deep Learning Tools in Docker Containers
With the success of deep learning techniques in a broad range of application domains, many deep learning software frameworks have been developed and are being updated frequently to adapt to new hardware features and software libraries, which bring a big challenge for end users and system administrators. To address this problem, container techniques are widely […]
Nov, 7
Scalable Streaming Tools for Analyzing N-body Simulations: Finding Halos and Investigating Excursion Sets in One Pass
Cosmological N-body simulations play a vital role in studying how the Universe evolves. To compare to observations and make scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do not scale to the datasets that are forbiddingly large in modern […]
Nov, 7
Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs
Efficiently exploiting GPUs is increasingly essential in scientific computing, as many current and upcoming supercomputers are built using them. To facilitate this, there are a number of programming approaches, such as CUDA, OpenACC and OpenMP 4, supporting different programming languages (mainly C/C++ and Fortran). There are also several compiler suites (clang, nvcc, PGI, XL) each […]
Nov, 7
Radeon PRO Solid State Graphics (SSG) API User Manual
The Radeon Pro SSG software library enables peer-to-peer (P2P) data transfers between GPU and Radeon on board SSD devices. It allows a methodology to read OS file data from SSDs to OpenCL, OpenGL and DirectX buffers with very low-latency P2P communication. The development kit version of this library supports only the Microsoft Windows 10 operating […]
Nov, 7
Lattice QCD on new chips: a community summary
I review the most recent evolutions of the QCD codes on new architectures, with a focus on the performances obtained by the different coding strategies as presented during the Lattice-2017 conference.
Nov, 7
Acceleration of tensor-product operations for high-order finite element methods
This paper is devoted to GPU kernel optimization and performance analysis of three tensor-product operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators’ properties: low arithmetic intensity, tiered structure, and the need to store […]
Nov, 5
Dynamic Load Balancing Strategies for Graph Applications on GPUs
Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent irregularity in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, […]
Nov, 5
A Dynamic Hash Table for the GPU
We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this […]
Nov, 5
Data Coherence Analysis and Optimization for Heterogeneous Computing
Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance […]
Nov, 5
ChainerMN: Scalable Distributed Deep Learning Framework
One of the keys for deep learning to have made a breakthrough in various fields was to utilize high computing powers centering around GPUs. Enabling the use of further computing abilities by distributed processing is essential not only to make the deep learning bigger and faster but also to tackle unsolved challenges. We present the […]