high performance computing on graphics processing units: hgpu.org

Posts

Nov, 12

Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures

The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on computers with modern multicore CPUs and manycore accelerators including Intel MIC and GPU architectures using a new vectorized version of LCG pseudorandom number generator which requires limited amount of memory. We introduce two new implementations of […]

Nov, 12

Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms

We present a highly scalable Monte Carlo (MC) 3D photon transport simulation platform designed for heterogeneous computing systems. By developing a massively parallel MC algorithm using the OpenCL framework, this research extends our existing GPU-accelerated MC technique to a highly-scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel […]

OpenCL

Nov, 12

Best Practice Guide – GPGPU

Graphics Processing Units (GPUs) were originally developed for computer gaming and other graphical tasks, but for many years have been exploited for general purpose computing across a number of areas. They offer advantages over traditional CPUs because they have greater computational capability, and use high-bandwidth memory systems (where memory bandwidth is the main bottleneck for […]

CUDA

•

OpenCL

Nov, 12

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

The High Performance Computing (HPC) community recognizes energy consumption as a major problem. Extensive research is underway to identify means to increase energy efficiency of HPC systems including consideration of alternative building blocks for future systems. This thesis considers one such system, the Texas Instruments Keystone II, a heterogeneous Low-Power System-on-Chip (LPSoC) processor that combines […]

OpenCL

Nov, 12

Performance Evaluation of Deep Learning Tools in Docker Containers

With the success of deep learning techniques in a broad range of application domains, many deep learning software frameworks have been developed and are being updated frequently to adapt to new hardware features and software libraries, which bring a big challenge for end users and system administrators. To address this problem, container techniques are widely […]

CUDA

Nov, 7

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Efficiently exploiting GPUs is increasingly essential in scientific computing, as many current and upcoming supercomputers are built using them. To facilitate this, there are a number of programming approaches, such as CUDA, OpenACC and OpenMP 4, supporting different programming languages (mainly C/C++ and Fortran). There are also several compiler suites (clang, nvcc, PGI, XL) each […]

CUDA

Nov, 7

Radeon PRO Solid State Graphics (SSG) API User Manual

The Radeon Pro SSG software library enables peer-to-peer (P2P) data transfers between GPU and Radeon on board SSD devices. It allows a methodology to read OS file data from SSDs to OpenCL, OpenGL and DirectX buffers with very low-latency P2P communication. The development kit version of this library supports only the Microsoft Windows 10 operating […]

OpenCL

•

OpenGL

Nov, 7

Scalable Streaming Tools for Analyzing N-body Simulations: Finding Halos and Investigating Excursion Sets in One Pass

Cosmological N-body simulations play a vital role in studying how the Universe evolves. To compare to observations and make scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do not scale to the datasets that are forbiddingly large in modern […]

CUDA

Nov, 7

Lattice QCD on new chips: a community summary

I review the most recent evolutions of the QCD codes on new architectures, with a focus on the performances obtained by the different coding strategies as presented during the Lattice-2017 conference.

CUDA

Nov, 7

Acceleration of tensor-product operations for high-order finite element methods

This paper is devoted to GPU kernel optimization and performance analysis of three tensor-product operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators’ properties: low arithmetic intensity, tiered structure, and the need to store […]

CUDA

Nov, 5

Dynamic Load Balancing Strategies for Graph Applications on GPUs

Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent irregularity in graph applications leads to several challenges for parallelization. A key challenge, which we address in this paper, is that of load-imbalance. If the work-assignment to threads uses node-based graph partitioning, […]

CUDA

Nov, 5

A Dynamic Hash Table for the GPU

We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures

Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms

Best Practice Guide – GPGPU

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

Performance Evaluation of Deep Learning Tools in Docker Containers

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Radeon PRO Solid State Graphics (SSG) API User Manual

Scalable Streaming Tools for Analyzing N-body Simulations: Finding Halos and Investigating Excursion Sets in One Pass

Lattice QCD on new chips: a community summary

Acceleration of tensor-product operations for high-order finite element methods

Dynamic Load Balancing Strategies for Graph Applications on GPUs

A Dynamic Hash Table for the GPU

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)