high performance computing on graphics processing units: hgpu.org

Posts

Jan, 17

Instruments of Productivity for High Performance Computing

High performance computing (HPC) is now well established as the cornerstone for building and conducting software simulations in numerous scientific and industrial fields. The hardware complexity of supercomputers is steadily increasing, however, to deliver ever improved computing performance, causing the complexity of HPC application development to increase as well. As a result, the need for […]

Jan, 17

Implementation of Autoencoders with Systolic Arrays through OpenCL

In the world of algorithm acceleration and the implementation of deep neural networks’ recall phase, OpenCL based solutions have a clear tendency to produce perfectly adapted kernels in graphic processor unit (GPU) architectures. However, they fail to obtain the same results when applied to field-programmable gate array (FPGA) based architectures. This situation, along with an […]

OpenCL

Jan, 17

CFD code adaptation to the FPGA architecture

For the last years, we observe the intensive development of accelerated computing platforms. Although current trends indicate a well-established position of GPU devices in the HPC environment, FPGA (Field-Programmable Gate Array) aspires to be an alternative solution to offload the CPU computation. This paper presents a systematic adaptation of four various CFD (Computational Fluids Dynamic) […]

OpenCL

Jan, 17

Explainable Deep Behavioral Sequence Clustering for Transaction Fraud Detection

In e-commerce industry, user behavior sequence data has been widely used in many business units such as search and merchandising to improve their products. However, it is rarely used in financial services not only due to its 3V characteristics – i.e. Volume, Velocity and Variety – but also due to its unstructured nature. In this […]

Jan, 17

Fast convolutional neural networks on FPGAs with hls4ml

We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks with large convolutional layers on FPGAs. By extending the hls4ml library, we demonstrate how to achieve inference latency of 5μs using convolutional architectures, while preserving state-of-the-art model performance. Considering benchmark models trained on the Street View House Numbers Dataset, we demonstrate various […]

Jan, 10

linus: Conveniently explore, share, and present large-scale biological trajectory data from a web browser

In biology, we are often confronted with information-rich, large-scale trajectory data, but exploring and communicating patterns in such data is often a cumbersome task. Ideally, the data should be wrapped with an interactive visualisation in one concise package that makes it straightforward to create and test hypotheses collaboratively. To address these challenges, we have developed […]

OpenCL

Jan, 10

Advances in Electron Microscopy with Deep Learning

This doctoral thesis covers some of my advances in electron microscopy with deep learning. Highlights include a comprehensive review of deep learning in electron microscopy; large new electron microscopy datasets for machine learning, dataset search engines based on variational autoencoders, and automatic data clustering by t-distributed stochastic neighbour embedding; adaptive learning rate clipping to stabilize […]

Jan, 10

Efficient Nearest-Neighbor Data Sharing in GPUs

Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing […]

CUDA

Jan, 10

Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to […]

Jan, 10

Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs

Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the uttermost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. A Field-Programmable Gate Array is a reconfigurable hardware accelerator that is fully […]

CUDA

•

OpenCL

Jan, 6

9th International Workshop on OpenCL and SYCL, 2021

IWOCL & SYCLcon is the annual gathering of the international community of OpenCL and SYCL developers, researchers, suppliers and Khronos Working Group members to share best practice, and to advance the use and evolution of the Open Computing Language (OpenCL) and the SYCL standard for C++ programming of heterogeneous platforms and their associated ecosystems. This […]

Jan, 3

Design, Implementation and Test of Efficient GPU to GPU Communication Methods

Stencil codes are commonly used to solve many problems. On parallel heterogeneous systems with CPUs and GPUs, the domain is usually split and assigned to GPUs, where it is further divided to GPU blocks. The iterative distributed stencil computation consists of two steps – computation and communication, where the subdomains exchange boundary data, also called […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Instruments of Productivity for High Performance Computing

Implementation of Autoencoders with Systolic Arrays through OpenCL

CFD code adaptation to the FPGA architecture

Explainable Deep Behavioral Sequence Clustering for Transaction Fraud Detection

Fast convolutional neural networks on FPGAs with hls4ml

linus: Conveniently explore, share, and present large-scale biological trajectory data from a web browser

Advances in Electron Microscopy with Deep Learning

Efficient Nearest-Neighbor Data Sharing in GPUs

Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs

9th International Workshop on OpenCL and SYCL, 2021

Design, Implementation and Test of Efficient GPU to GPU Communication Methods

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)