high performance computing on graphics processing units: hgpu.org

Posts

Nov, 12

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

The High Performance Computing (HPC) community recognizes energy consumption as a major problem. Extensive research is underway to identify means to increase energy efficiency of HPC systems including consideration of alternative building blocks for future systems. This thesis considers one such system, the Texas Instruments Keystone II, a heterogeneous Low-Power System-on-Chip (LPSoC) processor that combines […]

OpenCL

Nov, 12

Performance Evaluation of Deep Learning Tools in Docker Containers

With the success of deep learning techniques in a broad range of application domains, many deep learning software frameworks have been developed and are being updated frequently to adapt to new hardware features and software libraries, which bring a big challenge for end users and system administrators. To address this problem, container techniques are widely […]

CUDA

Nov, 7

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Efficiently exploiting GPUs is increasingly essential in scientific computing, as many current and upcoming supercomputers are built using them. To facilitate this, there are a number of programming approaches, such as CUDA, OpenACC and OpenMP 4, supporting different programming languages (mainly C/C++ and Fortran). There are also several compiler suites (clang, nvcc, PGI, XL) each […]

CUDA

Nov, 7

Radeon PRO Solid State Graphics (SSG) API User Manual

The Radeon Pro SSG software library enables peer-to-peer (P2P) data transfers between GPU and Radeon on board SSD devices. It allows a methodology to read OS file data from SSDs to OpenCL, OpenGL and DirectX buffers with very low-latency P2P communication. The development kit version of this library supports only the Microsoft Windows 10 operating […]

OpenCL

•

OpenGL

Nov, 7

Scalable Streaming Tools for Analyzing N-body Simulations: Finding Halos and Investigating Excursion Sets in One Pass

Cosmological N-body simulations play a vital role in studying how the Universe evolves. To compare to observations and make scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do not scale to the datasets that are forbiddingly large in modern […]

CUDA

Nov, 7

Lattice QCD on new chips: a community summary

I review the most recent evolutions of the QCD codes on new architectures, with a focus on the performances obtained by the different coding strategies as presented during the Lattice-2017 conference.

CUDA

Nov, 7

Acceleration of tensor-product operations for high-order finite element methods

This paper is devoted to GPU kernel optimization and performance analysis of three tensor-product operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators’ properties: low arithmetic intensity, tiered structure, and the need to store […]

CUDA

Nov, 5

Dynamic Load Balancing Strategies for Graph Applications on GPUs

Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent irregularity in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, […]

CUDA

Nov, 5

A Dynamic Hash Table for the GPU

We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this […]

CUDA

Nov, 5

Data Coherence Analysis and Optimization for Heterogeneous Computing

Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance […]

OpenCL

Nov, 5

ChainerMN: Scalable Distributed Deep Learning Framework

One of the keys for deep learning to have made a breakthrough in various fields was to utilize high computing powers centering around GPUs. Enabling the use of further computing abilities by distributed processing is essential not only to make the deep learning bigger and faster but also to tackle unsolved challenges. We present the […]

CUDA

Nov, 5

Deep and Shallow convections in Atmosphere Models on Intel Xeon Phi Coprocessor Systems

Deep and shallow convection calculations occupy significant times in atmosphere models. These calculations also present significant load imbalances due to varying cloud covers over different regions of the grid. In this work, we accelerate these calculations on Intel Xeon Phi Coprocessor Systems. By employing dynamic scheduling in OpenMP, we demonstrate large reductions in load imbalance […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

Performance Evaluation of Deep Learning Tools in Docker Containers

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Radeon PRO Solid State Graphics (SSG) API User Manual

Scalable Streaming Tools for Analyzing N-body Simulations: Finding Halos and Investigating Excursion Sets in One Pass

Lattice QCD on new chips: a community summary

Acceleration of tensor-product operations for high-order finite element methods

Dynamic Load Balancing Strategies for Graph Applications on GPUs

A Dynamic Hash Table for the GPU

Data Coherence Analysis and Optimization for Heterogeneous Computing

ChainerMN: Scalable Distributed Deep Learning Framework

Deep and Shallow convections in Atmosphere Models on Intel Xeon Phi Coprocessor Systems

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)