high performance computing on graphics processing units: hgpu.org

Posts

Jun, 28

GPU-based matrix-free finite element solver exploiting symmetry of elemental matrices

Matrix-free solvers for finite element method (FEM) avoid assembly of elemental matrices and replace sparse matrix-vector multiplication required in iterative solution method by an element level dense matrix-vector product. In this paper, a novel matrix-free strategy for FEM is proposed which computes element level matrix-vector product by using only the symmetric part of the elemental […]

CUDA

Jun, 28

Sparse GPU Kernels for Deep Learning

Scientific workloads have traditionally exploited high levels of sparsity to accelerate computation and reduce memory requirements. While deep neural networks can be made sparse, achieving practical speedups on GPUs is difficult because these applications have relatively moderate levels of sparsity that are not sufficient for existing sparse kernels to outperform their dense counterparts. In this […]

CUDA

Jun, 28

Preparing Ginkgo for AMD GPUs – A Testimonial on Porting CUDA Code to HIP

With AMD reinforcing their ambition in the scientific high performance computing ecosystem, we extend the hardware scope of the Ginkgo linear algebra package to feature a HIP backend for AMD GPUs. In this paper, we report and discuss the porting effort from CUDA, the extension of the HIP framework to add missing features such as […]

CUDA

Jun, 28

Automatic Kernel Generation for Volta Tensor Cores

A commonly occurring computation idiom in neural networks is to perform some pointwise operations on the result of a matrix multiplication. Such a sequence of operations is typically represented as a computation graph in deep learning compilers. When compiling to a GPU target, these computations can be individually mapped to manually tuned implementations provided by […]

CUDA

Jun, 21

Autotuning for Automatic Parallelization on Heterogeneous Systems

To meet the surging demand for high-speed computation in an era of stagnating increase in performance per processor, systems designers resort to aggregating many and even heterogeneous processors into single systems. Automatic parallelization tools relieve application developers of the tedious and error prone task of programming these heterogeneous systems. For these tools, there are two […]

CUDA

Jun, 21

FPGA Based Satisfiability Checking

The Boolean satisfiability problem, abbreviated as SAT, is the backbone of many applications in VLSI design automation and verification. Over the years, many SAT solvers, both complete and incomplete, have been developed. Complete solvers are usually based on the DPLL (Davis–Putnam–Logemann–Loveland) algorithm, which is a backtracking algorithm. Industrial strength problems are very large and make […]

OpenCL

Jun, 21

Heterogeneous Parallelization and Acceleration of Molecular Dynamics Simulations in GROMACS

The introduction of accelerator devices such as graphics processing units (GPUs) has had profound impact on molecular dynamics simulations and has enabled order-of-magnitude performance advances using commodity hardware. To fully reap these benefits, it has been necessary to reformulate some of the most fundamental algorithms, including the Verlet list, pair searching and cut-offs. Here, we […]

CUDA

•

OpenCL

Jun, 21

Unsupervised Deep Learning of Incompressible Fluid Dynamics

Fast and stable fluid simulations are an essential prerequisite for applications ranging from computer aided aerodynamic design of automobiles or airplanes to simulations of physical effects in CGI to research in meteorology. Recent differentiable fluid simulations allow gradient based methods to optimize e.g. fluid control systems in an informed manner. Solving the partial differential equations […]

Jun, 21

Ansor: Generating High-Performance Tensor Programs for Deep Learning

High-performance tensor programs are crucial to guarantee efficient execution of deep learning models. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously difficult. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering efforts in developing […]

CUDA

Jun, 14

The Rodinia Benchmark Suite in SYCL

We apply the SYCL programming model to the Rodinia benchmark suite, describe the transformations from the OpenCL implementations to the SYCL implementations, and evaluate the benchmarks on microprocessors with a CPU and an integrated GPU. The publicly available implementations of the benchmark suite will track the development of the SYCL compilers, and provide programs for […]

OpenCL

Jun, 14

A Compiler Infrastructure for Embedded Multicore SoCs

Compilers play a pivotal role in the software development process for microprocessors, by automatically translating high-level programming languages into machinespecific executable code. For a long time, while processors were scalar, compilers were regarded as a black box among the software community, due to their successful internal encapsulation of machine-specific details. Over a decade ago, major […]

OpenGL

Jun, 14

Software Testing – Test Suite Compilation and Execution Optimizations

The requirements and responsibilities assumed by software have increasingly rendered it to be large and complex. Testing to ensure that software meets all its requirements and is free from failures is a difficult and time-consuming task that necessitates the use of large test suites, containing many test cases. Time needed to compile and execute large […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPU-based matrix-free finite element solver exploiting symmetry of elemental matrices

Sparse GPU Kernels for Deep Learning

Preparing Ginkgo for AMD GPUs – A Testimonial on Porting CUDA Code to HIP

Automatic Kernel Generation for Volta Tensor Cores

Autotuning for Automatic Parallelization on Heterogeneous Systems

FPGA Based Satisfiability Checking

Heterogeneous Parallelization and Acceleration of Molecular Dynamics Simulations in GROMACS

Unsupervised Deep Learning of Incompressible Fluid Dynamics

Ansor: Generating High-Performance Tensor Programs for Deep Learning

The Rodinia Benchmark Suite in SYCL

A Compiler Infrastructure for Embedded Multicore SoCs

Software Testing – Test Suite Compilation and Execution Optimizations

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)