18854

Posts

Apr, 20

Loop Perforation in OpenACC

High-level programming models such as OpenMP and OpenACC are used to accelerate loop-parallelizable applications. In such applications, a very large number of loop iterations are launched as threads on the accelerator, where every iteration executes the same code sequence (loop body or kernel) but on different data. In such workloads, similarities in the input lead […]
Apr, 20

A Comparative Study of Asynchronous Many-Tasking Runtimes: Cilk, Charm++, ParalleX and AM++

We evaluate and compare four contemporary and emerging runtimes for high-performance computing(HPC) applications: Cilk, Charm++, ParalleX and AM++. We compare along three bases: programming model, execution model and the implementation on an underlying machine model. The comparison study includes a survey of each runtime system’s programming models, their corresponding execution models, their stated features, and […]
Apr, 20

On Optimizing Complex Stencils on GPUs

Stencil computations are often the computeintensive kernel in many scientific applications. With the increasing demand for computational accuracy, and the emergence of massively data-parallel high-bandwidth architectures like GPUs, stencils have steadily become more complex in terms of the stencil order, data accesses, and reuse patterns. Many prior efforts have focused on optimizing simpler stencil computations […]
Apr, 20

Real world applications of Artificial Intelligence on constrained hardware

These days the field of Artificial Intelligence (and its many subfields) is moving really fast, many new techniques are becoming available from various different subfields. However, many of these algorithms are only made to run on very powerful research workstations without considering how they can be used on real-world hardware, be it embedded hardware, powerful […]
Apr, 20

Concurrent query processing in a GPU-based database system

The unrivaled computing capabilities of modern GPUs meet the demand of processing massive amounts of data seen in many application domains. While traditional HPC systems support applications as standalone entities that occupy entire GPUs, there are GPU-based DBMSs where multiple tasks are meant to be run at the same time in the same device. To […]
Apr, 14

Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN

Over the past few years machine learning has seen a renewed explosion of interest, following a number of studies showing the effectiveness of neural networks in a range of tasks which had previously been considered incredibly hard. Neural networks’ effectiveness in the fields of image recognition and natural language processing stems primarily from the vast […]
Apr, 14

OpenCL vs: Accelerated Finite-Difference Digital Synthesis

Digital audio synthesis has become an important component of modern music production with techniques that can produce realistic simulations of real instruments. Physical modelling sound synthesis is a category of audio synthesis that uses mathematical models to emulate the physical phenomena of acoustic musical instruments including drum membranes, air columns and strings. The synthesis of […]
Apr, 14

Distributed Deep Learning Strategies For Automatic Speech Recognition

In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one of the most widely used datasets for ASR performance benchmark. We first investigate what are the […]
Apr, 14

On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

Traditional optimizing compilers rely on rewrite rules to iteratively apply program transformations. This iterative approach hides optimization opportunities behind intermediate transformation steps. For instance, vectorization can only be applied to the innermost loop in a nest: one must first perform a loop interchange before even considering vectorization of an outer loop. In contrast, we propose […]
Apr, 14

Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels

Over recent years heterogeneous systems have become more prevalent across HPC systems, with over 100 supercomputers in the TOP500 incorporating GPUs or other accelerators. These hardware platforms have different performance characteristics and optimization requirements. In order to make the most of multiple accelerators a developer has to provide implementations of their algorithms tuned for each […]
Apr, 7

High Performance Monte Carlo Simulation of Ising Model on TPU Clusters

Large scale deep neural networks profited from an emerging class of AI accelerators. Although the accelerators are specialized for machine learning, some of their designs are general enough for other computing intensive applications. Cloud TPU, as one of them, offers tremendous computing resources and is easily accessible through TensorFlow by expressing the computation in a […]
Apr, 7

The Study of the OpenCL Processing Models for the FPGA Devices

In our study, we present the results of the implementation of the SHA-512 algorithm in FPGAs. The distinguished element of our work is that we conducted the work using OpenCL for FPGA, which is a relatively new development method for reconfigurable logic. We examine loop unrolling as an OpenCL performance optimization method and compare the […]

* * *

* * *

HGPU group © 2010-2019 hgpu.org

All rights belong to the respective authors

Contact us: