18932

Posts

Jun, 9

A Survey on Evaluating and Optimizing Performance of Intel Xeon Phi

Intel’s Xeon Phi combines the parallel processing power of a many-core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors […]
Jun, 9

PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion

OpenCL offers code portability but no performance portability. Given an OpenCL program X specifically written for one platform P, existing OpenCL compilers, which usually optimize its host and kernel codes individually, often yield poor performance for another platform Q. Instead of obtaining a performance-improved version of X for Q via manual tuning, we aim to […]
Jun, 9

Diagnosing Performance Bottlenecks in HPC Applications

The software performance optimizations process is one of the most challenging aspects of developing highly performant code because underlying performance limitations are hard to diagnose. In many cases, identifying performance bottlenecks, such as latency stalls, requires a combination of fidelity and usability that existing tools do not provide: traditional performance models and runtime analysis lack […]
Jun, 9

LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure

A primary barrier to rapid hardware specialization with FPGAs stems from weak guarantees of existing CAD tools on achieving design closure. Current methodologies require extensive manual efforts to configure a large set of options across multiple stages of the toolflow, intended to achieve high quality-of-results. Due to the size and complexity of the design space […]
Jun, 5

ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data

Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing of inputs that require more involved parsing rules is challenging to parallelise. This work proposes a massively […]
Jun, 5

Dynamic Distribution Pruning for Efficient Network Architecture Search

Network architectures obtained by Neural Architecture Search (NAS) have shown state-of-the-art performance in various computer vision tasks. Despite the exciting progress, the computational complexity of the forward-backward propagation and the search process makes it difficult to apply NAS in practice. In particular, most previous methods require thousands of GPU days for the search process to […]
Jun, 5

Raising the Performance of the Tinker-HP Molecular Modeling Package on Intel’s HPC Architectures: a Living Review [Article v1.0]

This living paper reviews the present High Performance Computing (HPC) capabilities of the Tinker-HP molecular modeling package. We focus here on the reference, double precision, massively parallel molecular dynamics engine present in Tinker-HP and dedicated to perform large scale simulations. We show how it can be adapted to recent Intel Central Processing Unit (CPU) petascale […]
Jun, 2

Classify QCD phase transition with deep learning

The state-of-the-art pattern recognition method in machine learning (deep convolution neural network) is used to identify the equation of state (EoS) employed in the relativistic hydrodynamic simulations of heavy ion collisions. High-level correlations of particle spectra in transverse momentum and azimuthal angle learned by the network act as an effective EoS-meter in deciphering the nature […]
Jun, 2

The Accelerator Wall: Limits of Chip Specialization

Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the benefits of chip specialization stems from optimizing a computational problem within a given chip’s transistor budget. Unfortunately, the stagnation of the […]
Jun, 2

A Development Platform for Embedded Domain-Specific Languages

The use of domain-specific languages (DSL) is a promising approach to helping programmers write an efficient program for high-performance computing. The programmers would feel difficulties in writing such a program by hand with only low-level abstractions, such as arrays and loops, provided by a general-purpose language. This chapter presents our new implementation technique for domainspecific […]
Jun, 2

Heterogeneous Resource-Elastic Scheduling for CPU+FPGA Architectures

Heterogeneous computing is a key strategy to meet the requirements of many compute-intensive applications. However, currently, CPU+FPGA platforms are commonly underutilized as scheduling is often constrained to a run-tocompletion model or acceleration of a single application at a time. To tackle this, this paper proposes heterogeneous resource-elastic scheduling for maximizing the utilization of both CPU […]
Jun, 2

Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models

We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: