high performance computing on graphics processing units: hgpu.org

Posts

Jun, 21

On the Use of a GPU-Accelerated Mobile Device Processor for Sound Source Localization

The growing interest to incorporate new features into mobile devices has increased the number of signal processing applications running over processors designed for mobile computing. A challenging signal processing field is acoustic source localization, which is attractive for applications such as automatic camera steering systems, human-machine interfaces, video gaming or audio surveillance. In this context, […]

OpenCL

Jun, 21

Rgtsvm: Support Vector Machines on a GPU in R

Rgtsvm provides a fast and flexible support vector machine (SVM) implementation for the R language. The distinguishing feature of Rgtsvm is that support vector classification and support vector regression tasks are implemented on a graphical processing unit (GPU), allowing the libraries to scale to millions of examples with >100-fold improvement in performance over existing implementations. […]

CUDA

Jun, 21

Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras

We introduce Kapre, Keras layers for audio and music signal preprocessing. Music research using deep neural networks requires a heavy and tedious preprocessing stage, for which audio processing parameters are often ignored in parameter optimisation. To solve this problem, Kapre implements time-frequency conversions, normalisation, and data augmentation as Keras layers. We report simple benchmark results, […]

CUDA

Jun, 17

Efficient OpenCL-based concurrent tasks offloading on accelerators

Current heterogeneous platforms with CPUs and accelerators have the ability to launch several independent tasks simultaneously, in order to exploit concurrency among them. These tasks typically consist of data transfer commands and kernel computation commands. In this paper we develop a runtime approach to optimize the concurrency between data transfers and kernel computation commands in […]

OpenCL

Jun, 17

Device Placement Optimization with Reinforcement Learning

The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural […]

CUDA

Jun, 17

Non-Hydrostatic Pressure Shallow Flows: GPU Implementation Using Finite-Volume and Finite-Difference Scheme

We consider the depth-integrated non-hydrostatic system derived by Yamazaki et al. An efficient formally second-order well-balanced hybrid finite volume/difference numerical scheme is proposed. The scheme consists in a two-step algorithm. First, the hyperbolic part of the system is discretized using a PVM path-conservative finite-volume method. Second, the dispersive terms are solved by means of compact […]

CUDA

Jun, 17

Parallel Monte Carlo on Intel MIC Architecture

Trade-off between the cost-efficiency of powerful computational accelerators and the increasing energy needed to perform numerical tasks can be tackled by implementation of algorithms on the Intel Multiple Integrated Cores (MIC) architecture. The best performance of the algorithms requires the use of appropriate optimization and parallelization approaches throughout all process of their design. Monte Carlo […]

Jun, 17

Parallel Computing of Particle Trajectory Sonification to Enable Real-Time Interactivity

In this paper, we revisit, explore and extend the Particle Trajectory Sonification (PTS) model, which supports cluster analysis of high-dimensional data by probing a model space with virtual particles which are "gravitationally" attracted to a mode of the dataset’s potential function. The particles’ kinetic energy progression of as function of time adds directly to a […]

OpenCL

Jun, 10

Smith-Waterman Acceleration in Multi-GPUs: A Performance per Watt Analysis

We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA se- quences in multi-GPU platforms using the exact Smith-Waterman method. Speed-up factors and energy consumption are monitored on different stages of the algorithm with the goal of identifying advantageous sce- narios to maximize acceleration […]

CUDA

Jun, 10

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker […]

CUDA

Jun, 10

Crane – Fast and Migratable GPU Passthrough for OpenCL applications

General purpose GPU (GPGPU) computing in virtualized environments leverages PCI passthrough to achieve GPU performance comparable to bare-metal execution. However, GPU passthrough prevents service administrators from performing virtual machine migration between physical hosts. Crane is a new technique for virtualizing OpenCL-based GPGPU computing that achieves within 5.25% of passthrough GPU performance while supporting VM migration. […]

OpenCL

Jun, 10

MobiRNN: Efficient Recurrent Neural Network Execution on Mobile GPU

In this paper, we explore optimizations to run Recurrent Neural Network (RNN) models locally on mobile devices. RNN models are widely used for Natural Language Processing, Machine Translation, and other tasks. However, existing mobile applications that use RNN models do so on the cloud. To address privacy and efficiency concerns, we show how RNN models […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

On the Use of a GPU-Accelerated Mobile Device Processor for Sound Source Localization

Rgtsvm: Support Vector Machines on a GPU in R

Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras

Efficient OpenCL-based concurrent tasks offloading on accelerators

Device Placement Optimization with Reinforcement Learning

Non-Hydrostatic Pressure Shallow Flows: GPU Implementation Using Finite-Volume and Finite-Difference Scheme

Parallel Monte Carlo on Intel MIC Architecture

Parallel Computing of Particle Trajectory Sonification to Enable Real-Time Interactivity

Smith-Waterman Acceleration in Multi-GPUs: A Performance per Watt Analysis

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Crane – Fast and Migratable GPU Passthrough for OpenCL applications

MobiRNN: Efficient Recurrent Neural Network Execution on Mobile GPU

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)