high performance computing on graphics processing units: hgpu.org

Posts

Feb, 23

Investigation of the OpenCL SYCL Programming Model

OpenCL SYCL is a new heterogeneous and parallel programming framework created by the Khronos Group that tries to bring OpenCL programming into C++. In particular, it enables C++ developers to create OpenCL kernels, using all the popular C++ features, such as classes, inheritance and templates. What is more, it dramatically reduces programming effort and complexity, […]

OpenCL

Feb, 23

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Hyperbolic conservation laws are important mathematical models for describing many phenomena in physics or engineering. The Finite Volume (FV) method and the Discontinuous Galerkin (DG) methods are two popular methods for solving conservation laws on computers. Those two methods are good candidates for parallel computing: a) they require a large amount of uniform and simple […]

OpenCL

Feb, 22

Stochastic Gradient Descent on GPUs

Irregular algorithms such as Stochastic Gradient Descent (SGD) can benefit from the massive parallelism available on GPUs. However, unlike in data-parallel algorithms, synchronization patterns in SGD are quite complex. Furthermore, scheduling for scale-free graphs is challenging. This work examines several synchronization strategies for SGD, ranging from simple locking to conflict-free scheduling. We observe that static […]

OpenCL

Feb, 22

High performance methods for frequent pattern mining

Current Big Data era is generating tremendous amount of data in most fields such as business, social media, engineering, and medicine. The demand to process and handle the resulting "big data" has led to the need for fast data mining methods to develop powerful and versatile analysis tools that can turn data into useful knowledge. […]

CUDA

Feb, 22

Comparison of SPMV performance on matrices with different matrix format using CUSP, cuSPARSE and ViennaCL

ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP. In addition to core functionality and many other features including BLAS level 1-3 support and iterative solvers, the latest release family ViennaCL 1.6.x provides fast […]

CUDA

•

OpenCL

Feb, 22

QPACE 2 and Domain Decomposition on the Intel Xeon Phi

We give an overview of QPACE 2, which is a custom-designed supercomputer based on Intel Xeon Phi processors, developed in a collaboration of Regensburg University and Eurotech. We give some general recommendations for how to write high-performance code for the Xeon Phi and then discuss our implementation of a domain-decomposition-based solver and present a number […]

Feb, 22

RSVDPACK: Subroutines for computing partial singular value decompositions via randomized sampling on single core, multi core, and GPU architectures

This document describes an implementation in C of a set of randomized algorithms for computing partial Singular Value Decompositions (SVDs). The techniques largely follow the prescriptions in the article "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions," N. Halko, P.G. Martinsson, J. Tropp, SIAM Review, 53(2), 2011, pp. 217-288, but with some […]

CUDA

Feb, 22

Exploring Design Space of 3D NVM and eDRAM Caches Using DESTINY Tool (open-source code)

To enable the design of large sized caches, novel memory technologies (such as non-volatile memory) and novel fabrication approaches (e.g. 3D stacking) have been explored. The existing modeling tools, however, cover only few memory technologies, CMOS technology nodes and fabrication approaches. We present DESTINY, a tool for modeling 3D (and 2D) cache designs using SRAM, […]

Feb, 19

Reproducible Triangular Solvers for High-Performance Computing

On modern parallel architectures, floating-point computations may become non-deterministic and, therefore, non-reproducible mainly due to non-associativity of floating-point operations. We propose an algorithm to solve dense triangular systems by leveraging the standard parallel triangular solver and our, recently introduced, multi-level exact summation approach. Finally, we present implementations of the proposed fast reproducible triangular solver and […]

OpenCL

Feb, 19

Fast, Memory-Efficient Construction of Voxelized Shadows

We present a fast and memory efficient algorithm for generating Compact Precomputed Voxelized Shadows. By performing much of the common sub-tree merging before identical nodes are ever created, we improve construction times by several orders of magnitude for large data structures, and require much less working memory. We also propose a new set of rules […]

CUDA

•

OpenGL

Feb, 19

Auto-tuning Shallow water simulations on GPUs

Graphic processing units (GPUs) have gained popularity in scientific computing the recent years. This is because of the massive computing power they can provide for parallel tasks, and while GPUs are powerful, it is also hard to fully utilize their power. A part of this difficulty comes from the many parameters available, and tuning of […]

CUDA

Feb, 19

Memory-efficient Adaptive Subdivision for Software Rendering on the GPU

The adaptive subdivision step for surface tessellation is a key component of the Reyes rendering pipeline. While this operation has been successfully parallelized for execution on the GPU using a breadth-first traversal, the resulting implementations are limited by their high worst-case memory consumption and high global memory bandwidth utilization. This report proposes an alternate strategy […]

OpenCL

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Posts

Investigation of the OpenCL SYCL Programming Model

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Stochastic Gradient Descent on GPUs

High performance methods for frequent pattern mining

Comparison of SPMV performance on matrices with different matrix format using CUSP, cuSPARSE and ViennaCL

QPACE 2 and Domain Decomposition on the Intel Xeon Phi

RSVDPACK: Subroutines for computing partial singular value decompositions via randomized sampling on single core, multi core, and GPU architectures

Exploring Design Space of 3D NVM and eDRAM Caches Using DESTINY Tool (open-source code)

Reproducible Triangular Solvers for High-Performance Computing

Fast, Memory-Efficient Construction of Voxelized Shadows

Auto-tuning Shallow water simulations on GPUs

Memory-efficient Adaptive Subdivision for Software Rendering on the GPU

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)