18846

Posts

Apr, 14

OpenCL vs: Accelerated Finite-Difference Digital Synthesis

Digital audio synthesis has become an important component of modern music production with techniques that can produce realistic simulations of real instruments. Physical modelling sound synthesis is a category of audio synthesis that uses mathematical models to emulate the physical phenomena of acoustic musical instruments including drum membranes, air columns and strings. The synthesis of […]
Apr, 14

Distributed Deep Learning Strategies For Automatic Speech Recognition

In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one of the most widely used datasets for ASR performance benchmark. We first investigate what are the […]
Apr, 14

Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels

Over recent years heterogeneous systems have become more prevalent across HPC systems, with over 100 supercomputers in the TOP500 incorporating GPUs or other accelerators. These hardware platforms have different performance characteristics and optimization requirements. In order to make the most of multiple accelerators a developer has to provide implementations of their algorithms tuned for each […]
Apr, 14

On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

Traditional optimizing compilers rely on rewrite rules to iteratively apply program transformations. This iterative approach hides optimization opportunities behind intermediate transformation steps. For instance, vectorization can only be applied to the innermost loop in a nest: one must first perform a loop interchange before even considering vectorization of an outer loop. In contrast, we propose […]
Apr, 7

High Performance Monte Carlo Simulation of Ising Model on TPU Clusters

Large scale deep neural networks profited from an emerging class of AI accelerators. Although the accelerators are specialized for machine learning, some of their designs are general enough for other computing intensive applications. Cloud TPU, as one of them, offers tremendous computing resources and is easily accessible through TensorFlow by expressing the computation in a […]
Apr, 7

The Study of the OpenCL Processing Models for the FPGA Devices

In our study, we present the results of the implementation of the SHA-512 algorithm in FPGAs. The distinguished element of our work is that we conducted the work using OpenCL for FPGA, which is a relatively new development method for reconfigurable logic. We examine loop unrolling as an OpenCL performance optimization method and compare the […]
Apr, 7

Full-System Simulation of Mobile CPU/GPU Platforms

Graphics Processing Units (GPUs) critically rely on a complex system software stack comprising kernel- and userspace drivers and Just-in-time (JIT) compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is […]
Apr, 7

TonY: An Orchestrator for Distributed Machine Learning Jobs

Training machine learning (ML) models on large datasets requires considerable computing power. To speed up training, it is typical to distribute training across several machines, often with specialized hardware like GPUs or TPUs. Managing a distributed training job is complex and requires dealing with resource contention, distributed configurations, monitoring, and fault tolerance. In this paper, […]
Apr, 7

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video […]
Mar, 31

Methods for Accelerating Machine Learning in High Performance Computing

Driven by massive dataset corpuses and advances and programmability in accelerator architectures, such as GPUs and FPGAs, machine learning (ML) has delivered remarkable, human-like accuracy in tasks such as image recognition, machine translation and speech processing. Although ML has improved accuracy in selected human tasks, the time to train models can range from hours to […]
Mar, 31

Dynamic Application Autotuning for Self-Aware Approximate Computing

In the autonomic computing context, we perceive the system as an ensemble of autonomous elements capable of self-managing, where endusers define high-level goals and the system shall adapt to achieve the desired behaviour. This runtime adaptation creates several optimisation opportunities, especially if we consider approximate computing applications, where it is possible to trade off the […]
Mar, 31

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

The combined impact of new computing resources and techniques with an increasing avalanche of large datasets, is transforming many research areas and may lead to technological breakthroughs that can be used by billions of people. In the recent years, Machine Learning and especially its subfield Deep Learning have seen impressive advances. Techniques developed within these […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: