high performance computing on graphics processing units: hgpu.org

Posts

Jun, 14

OpenCL-Based Erasure Coding on Heterogeneous Architectures

Erasure coding, Reed-Solomon coding in particular, is a key technique to deal with failures in scale-out storage systems. However, due to the algorithmic complexity, the performance overhead of erasure coding can become a significant bottleneck in storage systems attempting to meet service level agreements (SLAs). Previous work has mainly leveraged SIMD (singleinstruction multiple-data) instruction extensions […]

OpenCL

Jun, 14

Processing Big Data in Main Memory and on GPU

Many large-scale systems were designed with the assumption that I/O is the bottleneck, but this assumption has been challenged in the past decade with new trends in hardware capabilities and workload demands. The computational power of CPU cores has not improved proportional to the performance of disks and network interfaces in the past decade, but […]

CUDA

•

OpenCL

Jun, 14

Multi-GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

Using modern Graphic Processing Units (GPUs) becomes very useful for computing complex and time consuming processes. GPUs provide high-performance computation capabilities with a good price. This paper deals with a multi-GPU OpenCL and CUDA implementations of k-Nearest Neighbor (k-NN) algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for […]

CUDA

•

OpenCL

Jun, 9

Analysis and Parameter Prediction of Compiler Transformation for Graphics Processors

In the last decade graphics processors (GPUs) have been extensively used to solve computationally intensive problems. A variety of GPU architectures by different hardware manufacturers have been shipped in a few years. OpenCL has been introduced as the standard cross-vendor programming framework for GPU computing. Writing and optimising OpenCL applications is a challenging task, the […]

OpenCL

Jun, 9

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Sparse matrix vector multiplication (SpMV) is the dominant kernel in scientific simulations. Many-core processors such as GPUs accelerate SpMV computations with high parallelism and memory bandwidth compared to CPUs; however, even for many-core processors the performance of SpMV is still strongly limited by memory bandwidth and lower locality of memory access to input vector causes […]

CUDA

Jun, 9

Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide […]

CUDA

•

OpenCL

Jun, 9

OpenMP Parallelization and Optimization of Graph-based Machine Learning Algorithms

We investigate the OpenMP parallelization and optimization of two novel data classification algorithms. The new algorithms are based on graph and PDE solution techniques and provide significant accuracy and performance advantages over traditional data classification algorithms in serial mode. The methods leverage the Nystrom extension to calculate eigenvalue/eigenvectors of the graph Laplacian and this is […]

Jun, 9

Runtime Specialization for Heterogeneous CPU-GPU Platforms

Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match […]

CUDA

•

OpenCL

Jun, 7

Massively-Parallel Lossless Data Decompression

Today’s exponentially increasing data volumes and the high cost of storage make compression essential for the Big Data industry. Although research has concentrated on efficient compression, fast decompression is critical for analytics queries that repeatedly read compressed data. While decompression can be parallelized somewhat by assigning each data block to a different process, break-through speed-ups […]

CUDA

Jun, 7

Boda-RTC: Productive Generation of Portable, Efficient Code for Convolutional Neural Networks on Mobile Computing Platforms

The popularity of neural networks (NNs) spans academia, industry, and popular culture. In particular, convolutional neural networks (CNNs) have been applied to many image based machine learning tasks and have yielded strong results. The availability of hardware/software systems for efficient training and deployment of large and/or deep CNN models has been, and continues to be, […]

OpenCL

Jun, 7

Bit-Vectorized GPU Implementation of a Stochastic Cellular Automaton Model for Surface Growth

Stochastic surface growth models aid in studying properties of universality classes like the Kardar–Paris–Zhang class. High precision results obtained from large scale computational studies can be transferred to many physical systems. Many properties, such as roughening and some two-time functions can be studied using stochastic cellular automaton (SCA) variants of stochastic models. Here we present […]

CUDA

Jun, 7

Co-tuning of Software Specializers and Hardware Accelerators within a CNN Application

Software specializers and hardware accelerators share the common goal of decreasing the runtime of an operation while being parameterizable and abstracting away underlying optimizations from users. The competition for reconfigurable hardware resources among candidate hardware accelerators means that tuning must take place at an application level and not at an operation level as is the […]

high performance computing on graphics processing units: hgpu.org

Posts

OpenCL-Based Erasure Coding on Heterogeneous Architectures

Processing Big Data in Main Memory and on GPU

Multi-GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

Analysis and Parameter Prediction of Compiler Transformation for Graphics Processors

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

OpenMP Parallelization and Optimization of Graph-based Machine Learning Algorithms

Runtime Specialization for Heterogeneous CPU-GPU Platforms

Massively-Parallel Lossless Data Decompression

Boda-RTC: Productive Generation of Portable, Efficient Code for Convolutional Neural Networks on Mobile Computing Platforms

Bit-Vectorized GPU Implementation of a Stochastic Cellular Automaton Model for Surface Growth

Co-tuning of Software Specializers and Hardware Accelerators within a CNN Application

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)