high performance computing on graphics processing units: hgpu.org

Posts

Mar, 29

A generalized GPU-based connected component labeling algorithm

We propose a generalized GPU-based connected component labeling (CCL) algorithm that can be applied to both various lattices and to non-lattice environments in a uniform fashion. We extend our recent GPU-based CCL algorithm without the use of conventional iteration to the generalized method. As an application of this algorithm, we deal with the bond percolation […]

CUDA

Mar, 29

Generic Inverted Index on the GPU

Data variety, as one of the three Vs of the Big Data, is manifested by a growing number of complex data types such as documents, sequences, trees, graphs and high dimensional vectors. To perform similarity search on these data, existing works mainly choose to create customized indexes for different data types. Due to the diversity […]

CUDA

Mar, 29

A Stencil DSEL for Single Code Accelerated Computing with SYCL

Stencil kernels arise in many scientific codes as the result from dis-cretizing natural, continuous phenomenons. Many research works have designed stencil frameworks to help programmer optimize stencil kernels for performance, and to target CPUs or accelerators. However, existing stencil kernels, either library-based or language-based necessitate to write distinct source codes for accelerated kernels and for […]

OpenCL

Mar, 29

GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The […]

CUDA

Mar, 25

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets

An important class of problems involves training deep neural networks with sparse prediction targets of very high dimension D. These occur naturally in e.g. neural language models or the learning of word-embeddings, often posed as predicting the probability of next words among a vocabulary of size D (e.g. 200,000). Computing the equally large, but typically […]

CUDA

Mar, 25

Wanted: Floating-Point Add Round-off Error instruction

We propose a new instruction (FPADDRE) that computes the round-off error in floating-point addition. We explain how this instruction benefits high-precision arithmetic operations in applications where double precision is not sufficient. Performance estimates on Intel Haswell, Intel Skylake, and AMD Steamroller processors, as well as Intel Knights Corner co-processor, demonstrate that such an instruction would […]

CUDA

•

OpenCL

Mar, 25

Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent

SGD is the widely adopted method to train CNN. Conceptually it approximates the population with a randomly sampled batch; then it evenly trains batches by conducting a gradient update on every batch in an epoch. In this paper, we demonstrate Sampling Bias, Intrinsic Image Difference and Fixed Cycle Pseudo Random Sampling differentiate batches in training, […]

CUDA

Mar, 25

An Efficient Implementation of the Longest Common Subsequence Algorithm with Bit-Parallelism on GPUs

The longest common subsequence (LCS) for two given strings has various applications, such as for the comparison of deoxyribonucleic acid (DNA). In this thesis, we propose a graphics processing unit (GPU) algorithm to accelerate Hirschberg’s LCS algorithm improved with the bit-parallel algorithm by Crochemore et al. The algorithm by Crochemore et al. includes bitwise logical […]

CUDA

Mar, 25

A mixed precision semi-Lagrangian algorithm and its performance on accelerators

In this paper we propose a mixed precision algorithm in the context of the semi-Lagrangian discontinuous Galerkin method. The performance of this approach is evaluated on a traditional dual socket workstation as well as on a Xeon Phi and an NVIDIA K80. We find that the mixed precision algorithm can be implemented efficiently on these […]

CUDA

Mar, 25

A Survey of Recent Prefetching Techniques for Processor Caches

As the trends of process scaling make memory system even more crucial bottleneck, the importance of latency hiding techniques such as prefetching grows further. However, naively using prefetching can harm performance and energy efficiency and hence, several factors and parameters need to be taken into account to fully realize its potential. In this paper, we […]

Mar, 22

The First International Workshop on GPU Computing and Applications (GCA), 2016

Built for massive parallelism, General Purpose computing on Graphic Processing Unit (GPGPU) has superseded high-performance CPU in a number of important tasks, including computer graphics, physics calculations, encryption/decryption and scientific computations. The goal of this workshop is to provide a forum to discuss and evaluate emerging techniques, platforms and applications capable of harvesting the power […]

Mar, 22

Comparison of Technologies for General-Purpose Computing on Graphics Processing Units

The computational capacity of graphics cards for general-purpose computing have progressed fast over the last decade. A major reason is computational heavy computer games, where standard of performance and high quality graphics constantly rise. Another reason is better suitable technologies for programming the graphics cards. Combined, the product is high raw performance devices and means […]

OpenCL

•

OpenGL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A generalized GPU-based connected component labeling algorithm

Generic Inverted Index on the GPU

A Stencil DSEL for Single Code Accelerated Computing with SYCL

GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets

Wanted: Floating-Point Add Round-off Error instruction

Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent

An Efficient Implementation of the Longest Common Subsequence Algorithm with Bit-Parallelism on GPUs

A mixed precision semi-Lagrangian algorithm and its performance on accelerators

A Survey of Recent Prefetching Techniques for Processor Caches

The First International Workshop on GPU Computing and Applications (GCA), 2016

Comparison of Technologies for General-Purpose Computing on Graphics Processing Units

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)