high performance computing on graphics processing units: hgpu.org

Posts

Apr, 20

Real world applications of Artificial Intelligence on constrained hardware

These days the field of Artificial Intelligence (and its many subfields) is moving really fast, many new techniques are becoming available from various different subfields. However, many of these algorithms are only made to run on very powerful research workstations without considering how they can be used on real-world hardware, be it embedded hardware, powerful […]

OpenCL

Apr, 20

Concurrent query processing in a GPU-based database system

The unrivaled computing capabilities of modern GPUs meet the demand of processing massive amounts of data seen in many application domains. While traditional HPC systems support applications as standalone entities that occupy entire GPUs, there are GPU-based DBMSs where multiple tasks are meant to be run at the same time in the same device. To […]

CUDA

Apr, 14

Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN

Over the past few years machine learning has seen a renewed explosion of interest, following a number of studies showing the effectiveness of neural networks in a range of tasks which had previously been considered incredibly hard. Neural networks’ effectiveness in the fields of image recognition and natural language processing stems primarily from the vast […]

OpenCL

Apr, 14

OpenCL vs: Accelerated Finite-Difference Digital Synthesis

Digital audio synthesis has become an important component of modern music production with techniques that can produce realistic simulations of real instruments. Physical modelling sound synthesis is a category of audio synthesis that uses mathematical models to emulate the physical phenomena of acoustic musical instruments including drum membranes, air columns and strings. The synthesis of […]

OpenCL

•

OpenGL

Apr, 14

Distributed Deep Learning Strategies For Automatic Speech Recognition

In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one of the most widely used datasets for ASR performance benchmark. We first investigate what are the […]

CUDA

Apr, 14

Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels

Over recent years heterogeneous systems have become more prevalent across HPC systems, with over 100 supercomputers in the TOP500 incorporating GPUs or other accelerators. These hardware platforms have different performance characteristics and optimization requirements. In order to make the most of multiple accelerators a developer has to provide implementations of their algorithms tuned for each […]

OpenCL

Apr, 14

On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

Traditional optimizing compilers rely on rewrite rules to iteratively apply program transformations. This iterative approach hides optimization opportunities behind intermediate transformation steps. For instance, vectorization can only be applied to the innermost loop in a nest: one must first perform a loop interchange before even considering vectorization of an outer loop. In contrast, we propose […]

CUDA

Apr, 7

High Performance Monte Carlo Simulation of Ising Model on TPU Clusters

Large scale deep neural networks profited from an emerging class of AI accelerators. Although the accelerators are specialized for machine learning, some of their designs are general enough for other computing intensive applications. Cloud TPU, as one of them, offers tremendous computing resources and is easily accessible through TensorFlow by expressing the computation in a […]

CUDA

Apr, 7

The Study of the OpenCL Processing Models for the FPGA Devices

In our study, we present the results of the implementation of the SHA-512 algorithm in FPGAs. The distinguished element of our work is that we conducted the work using OpenCL for FPGA, which is a relatively new development method for reconfigurable logic. We examine loop unrolling as an OpenCL performance optimization method and compare the […]

OpenCL

Apr, 7

Full-System Simulation of Mobile CPU/GPU Platforms

Graphics Processing Units (GPUs) critically rely on a complex system software stack comprising kernel- and userspace drivers and Just-in-time (JIT) compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is […]

OpenCL

Apr, 7

TonY: An Orchestrator for Distributed Machine Learning Jobs

Training machine learning (ML) models on large datasets requires considerable computing power. To speed up training, it is typical to distribute training across several machines, often with specialized hardware like GPUs or TPUs. Managing a distributed training job is complex and requires dealing with resource contention, distributed configurations, monitoring, and fault tolerance. In this paper, […]

Apr, 7

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Real world applications of Artificial Intelligence on constrained hardware

Concurrent query processing in a GPU-based database system

Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN

OpenCL vs: Accelerated Finite-Difference Digital Synthesis

Distributed Deep Learning Strategies For Automatic Speech Recognition

Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels

On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

High Performance Monte Carlo Simulation of Ising Model on TPU Clusters

The Study of the OpenCL Processing Models for the FPGA Devices

Full-System Simulation of Mobile CPU/GPU Platforms

TonY: An Orchestrator for Distributed Machine Learning Jobs

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)