Posts
Nov, 1
Designing a Modern Skeleton Programming Framework for Parallel and Heterogeneous Systems
Today’s society is increasingly software-driven and dependent on powerful computer technology. Therefore it is important that advancements in the low-level processor hardware are made available for exploitation by a growing number of programmers of differing skill level. However, as we are approaching the end of Moore’s law, hardware designers are finding new and increasingly complex […]
Nov, 1
Towards Co-execution on Commodity Heterogeneous Systems: Optimizations for Time-Constrained Scenarios
Heterogeneous systems are present from powerful supercomputers, to mobile devices, including desktop computers, thanks to their excellent performance and energy consumption. The ubiquity of these architectures in both desktop systems and medium-sized service servers allow enough variability to exploit a wide range of problems, such as multimedia workloads, video encoding, image filtering and inference in […]
Nov, 1
Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling
While large neural networks demonstrate higher performance in various tasks, training large networks is difficult due to limitations on GPU memory size. We propose a novel out-of-core algorithm that enables faster training of extremely large-scale neural networks with sizes larger than allotted GPU memory. Under a given memory budget constraint, our scheduling algorithm locally adapts […]
Nov, 1
Not Half Bad: Exploring Half-Precision in Graph Convolutional Neural Networks
With the growing significance of graphs as an effective representation of data in numerous applications, efficient graph analysis using modern machine learning is receiving a growing level of attention. Deep learning approaches often operate over the entire adjacency matrix — as the input and intermediate network layers are all designed in proportion to the size […]
Nov, 1
Memory Optimization for Deep Networks
Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor computation in top-of-the-line GPUs increased by 32x over the last five years, the total available memory only grew by 2.5x. This prevents researchers from exploring larger architectures, as training large networks requires more memory for storing intermediate outputs. In this paper, we […]
Oct, 25
OpenCL Performance on the Intel Heterogeneous Architecture Research Platform
The fundamental operation of matrix multiplication is ubiquitous across a myriad of disciplines. Yet, the identification of new optimizations for matrix multiplication remains relevant for emerging hardware architectures and heterogeneous systems. Frameworks such as OpenCL enable computation orchestration on existing systems, and its availability using the Intel High Level Synthesis compiler allows users to architect […]
Oct, 25
Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs
Heterogeneous systems are becoming increasingly prevalent. In order to exploit the rich compute resources of such systems, robust programming models are needed for application developers to seamlessly migrate legacy code from today’s systems to tomorrow’s. Over the past decade and more, directives have been established as one of the promising paths to tackle programmatic challenges […]
Oct, 25
Mixed-Precision Embedding Using a Cache
In recommendation systems, practitioners observed that increase in the number of embedding tables and their sizes often leads to significant improvement in model performances. Given this and the business importance of these models to major internet companies, embedding tables for personalization tasks have grown to terabyte scale and continue to grow at a significant rate. […]
Oct, 25
Cross-platform programming model for many-core lattice Boltzmann simulations
We present a novel, hardware-agnostic implementation strategy for lattice Boltzmann (LB) simulations, which yields massive performance on homogeneous and heterogeneous many-core platforms. Based solely on C++17 Parallel Algorithms, our approach does not rely on any language extensions, external libraries, vendor-specific code annotations, or pre-compilation steps. Thanks in particular to a recently proposed GPU back-end to […]
Oct, 25
FlowPM: Distributed TensorFlow Implementation of the FastPM Cosmological N-body Solver
We present FlowPM, a Particle-Mesh (PM) cosmological N-body code implemented in Mesh-TensorFlow for GPU-accelerated, distributed, and differentiable simulations. We implement and validate the accuracy of a novel multi-grid scheme based on multiresolution pyramids to compute large scale forces efficiently on distributed platforms. We explore the scaling of the simulation on large-scale supercomputers and compare it […]
Oct, 18
When HLS Meets FPGA HBM: Benchmarking and Bandwidth Optimization
With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, we found that it is not easy to fully utilize the available bandwidth when developing some applications with high-level synthesis (HLS) tools. This is […]
Oct, 18
Portable high-order finite element kernels I: Streaming Operations
This paper is devoted to the development of highly efficient kernels performing vector operations relevant in linear system solvers. In particular, we focus on the low arithmetic intensity operations (i.e., streaming operations) performed within the conjugate gradient iterative method, using the parameters specified in the CEED benchmark problems for high-order hexahedral finite elements. We propose […]