high performance computing on graphics processing units: hgpu.org

Posts

Mar, 20

Interactive Illustrative Line Styles and Line Style Transfer Functions for Flow Visualization

We present a flexible illustrative line style model for the visualization of streamline data. Our model partitions view-oriented line strips into parallel bands whose basic visual properties can be controlled independently. We thus extend previous line stylization techniques specifically for visualization purposes by allowing the parametrization of these bands based on the local line data […]

Mar, 20

On learning optimized reaction diffusion processes for effective image restoration

For several decades, image restoration remains an active research topic in low-level computer vision and hence new approaches are constantly emerging. However, many recently proposed algorithms achieve state-of-the-art performance only at the expense of very high computation time, which clearly limits their practical relevance. In this work, we propose a simple but effective approach with […]

CUDA

Mar, 20

The More We Share, The More We Have: Improving GPU performance through Register Sharing

Graphics Processing Units (GPUs) consisting of Streaming Multiprocessors (SMs) achieve high throughput by running a large number of threads and context switching among them to hide execution latencies. The amount of thread level parallelism that can be utilized depends on the number of resident threads on each of the SMs. The threads are typically structured […]

CUDA

Mar, 20

Implementation of a Practical Distributed Calculation System with Browsers and JavaScript, and Application to Distributed Deep Learning

Deep learning can achieve outstanding results in various fields. However, it requires so significant computational power that graphics processing units (GPUs) and/or numerous computers are often required for the practical application. We have developed a new distributed calculation framework called "Sashimi" that allows any computer to be used as a distribution node only by accessing […]

OpenCL

Mar, 18

Fast Sparse Matrix Multiplication on GPU

Sparse matrix multiplication is an important algorithm in a wide variety of problems, including graph algorithms, simulations and linear solving to name a few. Yet, there are but a few works related to acceleration of sparse matrix multiplication on a GPU. We present a fast, novel algorithm for sparse matrix multiplication, outperforming the previous algorithm […]

CUDA

•

OpenCL

Mar, 18

Local vs. Global Optimization: Operator Placement Strategies in Heterogeneous Environments

In several parts of query optimization, like join enumeration or physical operator selection, there is always the question of how much optimization is needed and how large the performance benefits are. In particular, a decision for either global optimization (e.g., during query optimization) or local optimization (during query execution) has to be taken. In this […]

OpenCL

Mar, 18

Portable GPU-Based Artificial Neural Networks for Accelerated Data-Driven Modeling

Artificial neural network (ANN) is widely applied as the data-driven modeling tool in hydroinformatics due to its broad applicability of handling implicit and nonlinear relationships between the input and output data. To obtain a reliable ANN model, training ANN using the data is essential, but the training is usually taking many hours for a large […]

CUDA

•

OpenCL

Mar, 18

Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures

Large scale-free graphs are famously difficult to process efficiently: the highly skewed vertex degree distribution makes it difficult to obtain balanced workload partitions for parallel processing. Our research instead aims to take advantage of vertex degree heterogeneity by partitioning the workload to match the strength of the individual computing elements in a hybrid architecture. This […]

CUDA

Mar, 18

A Switched Dynamical System Framework for Analysis of Massively Parallel Asynchronous Numerical Algorithms

In the near future, massively parallel computing systems will be necessary to solve computation intensive applications. The key bottleneck in massively parallel implementation of numerical algorithms is the synchronization of data across processing elements (PEs) after each iteration, which results in significant idle time. Thus, there is a trend towards relaxing the synchronization and adopting […]

CUDA

Mar, 18

Fast Radix Sort for Sparse Linear Algebra on GPU

Fast sorting is an important step in many parallel algorithms, which require data ranking, ordering or partitioning. Parallel sorting is a widely researched subject, and many algorithms were developed in the past. In this paper, the focus is on implementing highly efficient sorting routines for the sparse linear algebra operations, such as parallel sparse matrix […]

CUDA

•

OpenCL

Mar, 14

Heterogeneous Acceleration of Volumetric JPEG 2000

We present the implementation of a volumetric JPEG 2000 codec as a real-world use case of software acceleration with GPUs and multi-core CPUs. We present a generic methodology to accelerate existing code written in C with OpenCL. Furthermore, we account for the volumetric nature of the processed data and formulate associated optimization guidelines. The resulting […]

OpenCL

Mar, 14

EmoNets: Multimodal deep learning approaches for emotion recognition in video

The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Interactive Illustrative Line Styles and Line Style Transfer Functions for Flow Visualization

On learning optimized reaction diffusion processes for effective image restoration

The More We Share, The More We Have: Improving GPU performance through Register Sharing

Implementation of a Practical Distributed Calculation System with Browsers and JavaScript, and Application to Distributed Deep Learning

Fast Sparse Matrix Multiplication on GPU

Local vs. Global Optimization: Operator Placement Strategies in Heterogeneous Environments

Portable GPU-Based Artificial Neural Networks for Accelerated Data-Driven Modeling

Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures

A Switched Dynamical System Framework for Analysis of Massively Parallel Asynchronous Numerical Algorithms

Fast Radix Sort for Sparse Linear Algebra on GPU

Heterogeneous Acceleration of Volumetric JPEG 2000

EmoNets: Multimodal deep learning approaches for emotion recognition in video

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)