high performance computing on graphics processing units: hgpu.org

Posts

Sep, 27

Solving Batched Linear Programs on GPU and Multicore CPU

Linear Programs (LPs) appear in a large number of applications and offloading them to the GPU is viable to gain performance. Existing work on offloading and solving an LP on GPU suggests that performance is gained from large sized LPs (typically 500 constraints, 500 variables and above). In order to gain performance from GPU for […]

CUDA

Sep, 27

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered […]

CUDA

Sep, 22

Bridging the Semantic Gaps of GPU Acceleration for Scaleout CNN-based Big Data Processing: Think Big, See Small

Convolutional Neural Networks (CNNs) have substantially advanced the state-of-the-art accuracies of object recognition, which is the core function of a myriad of modern multimedia processing techniques such as image/video processing, speech recognition, and natural language processing. GPU-based accelerators gained increasing attention because a large amount of highly parallel neurons in CNN naturally matches the GPU […]

CUDA

Sep, 22

Tuning Stencil Codes in OpenCL for FPGAs

OpenCL is designed as a parallel programming framework to support heterogeneous computing platforms. The implicit or explicit parallelism in OpenCL kernel code enables efficient FPGA implementation from a high-level programming abstraction. However, FPGA architecture is completely different from GPU architecture, for which OpenCL is widely used. Tuning OpenCL codes to achieve high performance on FPGAs […]

OpenCL

Sep, 22

Characterization of Speech Recognition Systems on GPU Architectures

Automatic speech recognition is one of the most important applications in the area of cognitive computing. Mobile devices, such as smartphones, have incorporated speech recognition as one of the main interfaces for user interaction. This trend towards voice-based user interfaces is likely to continue in the next years. Effective speech recognition systems require real-time recognition, […]

CUDA

Sep, 22

Efficient dictionary learning implementation on the GPU using OpenCL

The dictionary learning field offers a wide range of algorithms that are able to provide good sparse approximations and well trained dictionaries. These algorithms are very complex and this is reflected in the slow execution of their computationally intensive implementations. This article proposes efficient parallel implementations for the main algorithms in the field that significantly […]

OpenCL

Sep, 22

MCS 572: Introduction to Supercomputing

The goal of the course is to study parallel algorithms and their implementation on distributed and shared memory computers, using message passing, OpenMP, and threads. In the second half of the course we will consider general purpose graphics processing units. Prerequisites are a working knowledge of C (or willingness to acquire programming skills) and a […]

CUDA

Sep, 20

Acceleration of Block-Aware Matrix Factorization on Heterogeneous Platforms

Block-structured matrices arise in several contexts in circuit simulation problems. These matrices typically inherit the pattern of sparsity from the circuit connectivity. However, they are also characterized by dense spots or blocks. Direct factorization of those matrices has emerged as an attractive approach if the host memory is sufficiently large to store the block-structured matrix. […]

OpenCL

Sep, 20

Parallel Computational Fluid Dynamics With the Intel Xeon Phi Coprocessor

The Intel Xeon Phi coprocessor is a PCI Express form factor card designed to work in tangent with Intel Xeon processors in order to allow faster execution of highly parallelizable code. Efficient execution of highly parallel applications is achieved through the use of many smaller, lower clock speed cores; allowing for many more simultaneous execution […]

Sep, 20

A Compiler for Throughput Optimization of Graph Algorithms on GPUs

Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand. To address this problem, we have implemented these optimizations […]

CUDA

Sep, 20

Feynman Machine: The Universal Dynamical Systems Computer

Efforts at understanding the computational processes in the brain have met with limited success, despite their importance and potential uses in building intelligent machines. We propose a simple new model which draws on recent findings in Neuroscience and the Applied Mathematics of interacting Dynamical Systems. The Feynman Machine is a Universal Computer for Dynamical Systems, […]

OpenCL

Sep, 20

Runtime Support for Adaptive Power Capping on Heterogeneous SoCs

Power capping is a fundamental method for reducing the energy consumption of a wide range of modern computing environments, ranging from mobile embedded systems to datacentres. Unfortunately, maximising performance and system efficiency under static power caps remains challenging, while maximising performance under dynamic power caps has been largely unexplored. We present an adaptive power capping […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Solving Batched Linear Programs on GPU and Multicore CPU

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Bridging the Semantic Gaps of GPU Acceleration for Scaleout CNN-based Big Data Processing: Think Big, See Small

Tuning Stencil Codes in OpenCL for FPGAs

Characterization of Speech Recognition Systems on GPU Architectures

Efficient dictionary learning implementation on the GPU using OpenCL

MCS 572: Introduction to Supercomputing

Acceleration of Block-Aware Matrix Factorization on Heterogeneous Platforms

Parallel Computational Fluid Dynamics With the Intel Xeon Phi Coprocessor

A Compiler for Throughput Optimization of Graph Algorithms on GPUs

Feynman Machine: The Universal Dynamical Systems Computer

Runtime Support for Adaptive Power Capping on Heterogeneous SoCs

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)