18940

Posts

Jun, 16

SYCL Code Generation for Multigrid Methods

Multigrid methods are fast and scalable numerical solvers for partial differential equations (PDEs) that possess a large design space for implementing their algorithmic components. Code generation approaches allow formulating multigrid methods on a higher level of abstraction that can then be used to define a problem- and hardwarespecific solution. Since these problems have considerable implementation […]
Jun, 16

Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems

The increasing demands of modern embedded systems, such as highperformance and energy-efficiency, have motivated the use of heterogeneous multicore platforms enabled by Multiprocessor System-on-Chips(MPSoCs). To fully exploit the power of these platforms, new tools are needed to address the increasing software complexity to achieve a high productivity. An MPSoC compiler is a toolchain to tackle […]
Jun, 16

Performance Analysis and Automatic Tuning of Hash Aggregation on GPUs

Hash aggregation is an important data processing primitive which can be significantly accelerated by modern graphics processors (GPUs). Previous work derived heuristics for GPU-accelerated hash aggregation from the study of a particular GPU. In this paper, we examine the influence of different execution parameters on GPUaccelerated hash aggregation on four NVIDIA and two AMD GPUs […]
Jun, 12

Tensor Processing Units for Financial Monte Carlo

Monte Carlo methods are core to many routines in quantitative finance such as derivatives pricing, hedging and risk metrics. Unfortunately, Monte Carlo methods are very computationally expensive when it comes to running simulations in high-dimensional state spaces where they are still a method of choice in the financial industry. Recently, Tensor Processing Units (TPUs) have […]
Jun, 12

Performance Modelling of Deep Learning on Intel Many Integrated Core Architectures

Many complex problems, such as natural language processing or visual object detection, are solved using deep learning. However, efficient training of complex deep convolutional neural networks for large data sets is computationally demanding and requires parallel computing resources. In this paper, we present two parameterized performance models for estimation of execution time of training convolutional […]
Jun, 12

Parallel scalable simulations of biological neural networks using TensorFlow: A beginner’s guide

Neuronal networks are often modeled as systems of coupled, nonlinear, ordinary or partial differential equations. The number of differential equations used to model a network increases with the size of the network and the level of detail used to model individual neurons and synapses. As one scales up the size of the simulation it becomes […]
Jun, 9

Temporospatial Epidemic Simulations Using Heterogeneous Computing

Discrete Event Simulation (DES) is widely used for analysis of complex temporospatial epidemic models. In such simulations, a conspicuous fraction (50%-90%) of simulation runtime is typically spent in solving equations used to model epidemic progression. General Purpose Graphics Processing Units (GPGPUs) hold considerable potential to reduce time for solving epidemic equations. However, the significant differences […]
Jun, 9

A Survey on Evaluating and Optimizing Performance of Intel Xeon Phi

Intel’s Xeon Phi combines the parallel processing power of a many-core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors […]
Jun, 9

PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion

OpenCL offers code portability but no performance portability. Given an OpenCL program X specifically written for one platform P, existing OpenCL compilers, which usually optimize its host and kernel codes individually, often yield poor performance for another platform Q. Instead of obtaining a performance-improved version of X for Q via manual tuning, we aim to […]
Jun, 9

Diagnosing Performance Bottlenecks in HPC Applications

The software performance optimizations process is one of the most challenging aspects of developing highly performant code because underlying performance limitations are hard to diagnose. In many cases, identifying performance bottlenecks, such as latency stalls, requires a combination of fidelity and usability that existing tools do not provide: traditional performance models and runtime analysis lack […]
Jun, 9

LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure

A primary barrier to rapid hardware specialization with FPGAs stems from weak guarantees of existing CAD tools on achieving design closure. Current methodologies require extensive manual efforts to configure a large set of options across multiple stages of the toolflow, intended to achieve high quality-of-results. Due to the size and complexity of the design space […]
Jun, 5

ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data

Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing of inputs that require more involved parsing rules is challenging to parallelise. This work proposes a massively […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: