Posts
Jun, 9
Diagnosing Performance Bottlenecks in HPC Applications
The software performance optimizations process is one of the most challenging aspects of developing highly performant code because underlying performance limitations are hard to diagnose. In many cases, identifying performance bottlenecks, such as latency stalls, requires a combination of fidelity and usability that existing tools do not provide: traditional performance models and runtime analysis lack […]
Jun, 9
LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure
A primary barrier to rapid hardware specialization with FPGAs stems from weak guarantees of existing CAD tools on achieving design closure. Current methodologies require extensive manual efforts to configure a large set of options across multiple stages of the toolflow, intended to achieve high quality-of-results. Due to the size and complexity of the design space […]
Jun, 5
ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data
Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing of inputs that require more involved parsing rules is challenging to parallelise. This work proposes a massively […]
Jun, 5
Dynamic Distribution Pruning for Efficient Network Architecture Search
Network architectures obtained by Neural Architecture Search (NAS) have shown state-of-the-art performance in various computer vision tasks. Despite the exciting progress, the computational complexity of the forward-backward propagation and the search process makes it difficult to apply NAS in practice. In particular, most previous methods require thousands of GPU days for the search process to […]
Jun, 5
Raising the Performance of the Tinker-HP Molecular Modeling Package on Intel’s HPC Architectures: a Living Review [Article v1.0]
This living paper reviews the present High Performance Computing (HPC) capabilities of the Tinker-HP molecular modeling package. We focus here on the reference, double precision, massively parallel molecular dynamics engine present in Tinker-HP and dedicated to perform large scale simulations. We show how it can be adapted to recent Intel Central Processing Unit (CPU) petascale […]
Jun, 2
Classify QCD phase transition with deep learning
The state-of-the-art pattern recognition method in machine learning (deep convolution neural network) is used to identify the equation of state (EoS) employed in the relativistic hydrodynamic simulations of heavy ion collisions. High-level correlations of particle spectra in transverse momentum and azimuthal angle learned by the network act as an effective EoS-meter in deciphering the nature […]
Jun, 2
The Accelerator Wall: Limits of Chip Specialization
Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the benefits of chip specialization stems from optimizing a computational problem within a given chip’s transistor budget. Unfortunately, the stagnation of the […]
Jun, 2
A Development Platform for Embedded Domain-Specific Languages
The use of domain-specific languages (DSL) is a promising approach to helping programmers write an efficient program for high-performance computing. The programmers would feel difficulties in writing such a program by hand with only low-level abstractions, such as arrays and loops, provided by a general-purpose language. This chapter presents our new implementation technique for domainspecific […]
Jun, 2
Heterogeneous Resource-Elastic Scheduling for CPU+FPGA Architectures
Heterogeneous computing is a key strategy to meet the requirements of many compute-intensive applications. However, currently, CPU+FPGA platforms are commonly underutilized as scheduling is often constrained to a run-tocompletion model or acceleration of a single application at a time. To tackle this, this paper proposes heterogeneous resource-elastic scheduling for maximizing the utilization of both CPU […]
Jun, 2
Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective […]
May, 30
Breadth-First Search using Dynamic Parallelism on the GPU
Breadth-First Search is an important basis for many different graph-based algorithms with applications ranging from peer-to-peer networking to garbage collection. However, the performance of different approaches depends strongly on the type of graph. In this paper, three algorithms of varying complexity are implemented using the CUDA Programming Model for the GPU and are compared to […]
May, 30
The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study
Over the past years, great progress has been made in improving the computing power of general-purpose graphics processing units (GPGPUs), which facilitates the prosperity of deep neural networks (DNNs) in multiple fields like computer vision and natural language processing. A typical DNN training process repeatedly updates tens of millions of parameters, which not only requires […]