high performance computing on graphics processing units: hgpu.org

Posts

Aug, 28

Massively parallel simulations of relativistic fluid dynamics on graphics processing units with CUDA

Relativistic fluid dynamics is a major component in dynamical simulations of the quark-gluon plasma created in relativistic heavy-ion collisions. Simulations of the full three-dimensional dissipative dynamics of the quark-gluon plasma with fluctuating initial conditions are computationally expensive and typically require some degree of parallelization. In this paper, we present a GPU implementation of the Kurganov-Tadmor […]

CUDA

Aug, 23

Fast Multidimensional Image Processing with OpenCL

Multidimensional image data, i.e., images with three or more dimensions, are used in many areas of science. Multidimensional image processing is supported in Python and MATLAB. VisionGL is an open source library that provides a set of image processing functions and can help the programmer by automatically generating code. The objective of this work is […]

CUDA

•

OpenCL

Aug, 23

Accelerating Exact and Approximate Inference for (Distributed) Discrete Optimization with GPUs

Discrete optimization is a central problem in artificial intelligence. The optimization of the aggregated cost of a network of cost functions arises in a variety of problems including (W)CSP, DCOP, as well as optimization in stochastic variants such as Bayesian networks. Inference-based algorithms are powerful techniques for solving discrete optimization problems, which can be used […]

Aug, 23

MetaMorph: A Library Framework for Interoperable Kernels on Multi- and Many-core Clusters

To attain scalable performance efficiently, the HPC community expects future exascale systems to consist of multiple nodes, each with different types of hardware accelerators. In addition to GPUs and Intel MICs, additional candidate accelerators include embedded multiprocessors and FPGAs. End users need appropriate tools to efficiently use the available compute resources in such systems, both […]

CUDA

•

OpenCL

Aug, 23

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how to optimize batched GEMV and GEMM to assist batched advance factorization (e.g. bi-diagonalization) […]

CUDA

Aug, 23

Hybrid CPU-GPU Framework for Network Motifs

Massively parallel architectures such as the GPU are becoming increasingly important due to the recent proliferation of data. In this paper, we propose a key class of hybrid parallel graphlet algorithms that leverages multiple CPUs and GPUs simultaneously for computing k-vertex induced subgraph statistics (called graphlets). In addition to the hybrid multi-core CPU-GPU framework, we […]

Aug, 18

Streaming Applications on Heterogeneous Platforms

Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Currently, very few cases have been streamed to demonstrate the streaming performance impact and a systematic investigation of streaming necessity and how-to over a large number of test cases remains a gap. In this paper, we use […]

Aug, 18

GPU-Acceleration of In-Memory Data Analytics

Hardware advances strongly influence the database system design. The flattening speed of CPU cores makes many-core accelerators, such as GPUs, a vital alternative to explore for processing the ever-increasing amounts of data. GPUs have a significantly higher degree of parallelism than multi-core CPUs but their cores are simpler. As a result, they do not face […]

CUDA

Aug, 18

GPU-accelerated Gibbs Sampling

Gibbs sampling is a widely used Markov Chain Monte Carlo (MCMC) method for numerically approximating integrals of interest in Bayesian statistics and other mathematical sciences. Many implementations of MCMC methods do not extend easily to parallel computing environments, as their inherently sequential nature incurs a large synchronization cost. In this paper, we show how to […]

CUDA

Aug, 18

SkePU 2: Language Embedding and Compiler Support for Flexible and Type-Safe Skeleton Programming

This thesis presents SkePU 2, the next generation of the SkePU C++ framework for programming of heterogeneous parallel systems using the skeleton programming concept. SkePU 2 is presented after a thorough study of the state of parallel programming models, frameworks and tools, including other skeleton programming systems. The advancements in SkePU 2 include a modern […]

CUDA

•

OpenCL

Aug, 18

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Discriminative model learning for image denoising has been recently attracting considerable attentions due to its favorable denoising performance. In this paper, we take one step forward by investigating the construction of feed-forward denoising convolutional neural networks (DnCNNs) to embrace the progress in very deep architecture, learning algorithm, and regularization method into image denoising. Specifically, residual […]

CUDA

Aug, 16

Automatic Generation of OpenCL Code for ARM Architectures

The efficient exploitation of the increasing computational capabilities of mobile devices is still a challenge. The heterogeneity of Systems on Chip (SoC) makes necessary a very specific knowledge of their hardware in order to harness their full potential. OpenCL is a well known standard for cross-platform usage of accelerator devices. We follow an annotation-based approach […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Massively parallel simulations of relativistic fluid dynamics on graphics processing units with CUDA

Fast Multidimensional Image Processing with OpenCL

Accelerating Exact and Approximate Inference for (Distributed) Discrete Optimization with GPUs

MetaMorph: A Library Framework for Interoperable Kernels on Multi- and Many-core Clusters

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

Hybrid CPU-GPU Framework for Network Motifs

Streaming Applications on Heterogeneous Platforms

GPU-Acceleration of In-Memory Data Analytics

GPU-accelerated Gibbs Sampling

SkePU 2: Language Embedding and Compiler Support for Flexible and Type-Safe Skeleton Programming

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Automatic Generation of OpenCL Code for ARM Architectures

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)