high performance computing on graphics processing units: hgpu.org

Posts

Dec, 17

Toward Automatic Translation: From OpenACC to OpenMP 4

For the past few years, OpenACC has been the primary directive-based API for programming accelerator devices like GPUs. OpenMP 4.0 is now a competitor in this space, with support from different vendors. In our work, we analyse the feasibility for automatic conversion from OpenACC to OpenMP 4. We describe an algorithm to convert OpenACC device […]

Dec, 17

Speedup for quantum optimal control from GPU-based automatic differentiation

We implement a quantum optimal control algorithm based on automatic differentiation and harness the acceleration afforded by graphics processing units (GPUs). Automatic differentiation allows us to specify advanced optimization criteria and incorporate them in the optimization process with ease. We demonstrate that the use of GPUs can speed up calculations by more than an order […]

CUDA

Dec, 17

Parallel Level set algorithm with MPI and accelerated on GPU

Level set method has been used to capture interface motion. Narrow band algorithm is applied to localize the solving of level-set PDE on global domain to a tube around interface. Due to the unknown evolving interface, narrow band algorithm brings load balance problem for parallelizing computing. This work presents a tool for evenly distributing work […]

CUDA

Dec, 14

Automating the Last-Mile for High Performance Dense Linear Algebra

High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex […]

Dec, 14

Translating OpenMP Device Constructs to OpenCL using Unnecessary Data Transfer Elimination

In this paper, we propose a framework that translates OpenMP 4.0 accelerator directives to OpenCL. By translating an OpenMP program to an OpenCL program, the program can be executed on any hardware platform that supports OpenCL. We also propose a run-time optimization technique that automatically eliminates unnecessary data transfers between the host and the target […]

OpenCL

Dec, 14

Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation

The most popular multithreaded languages based on the fork-join concurrency model (CilkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural […]

CUDA

Dec, 14

nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware

In this work, a novel package called nmfgpu4R is presented, which offers the computation of Non-negative Matrix Factorization (NMF) on Compute Unified Device Architecture (CUDA) platforms within the R environment. Benchmarks show a remarkable speed-up in terms of time per iteration by utilizing the parallelization capabilities of modern graphics cards. Therefore the application of NMF […]

CUDA

Dec, 14

GaDei: On Scale-up Training As A Service For Deep Learning

Deep learning (DL) training-as-a-service (TaaS) is an important emerging industrial workload. The unique challenge of TaaS is that it must satisfy a wide range of customers who have no experience and resources to tune DL hyper-parameters, and meticulous tuning for each user’s dataset is prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with values that […]

CUDA

Dec, 13

5th International Conference on Sustainable Development (ICSD), 2017

The 5th ICSD 2017 will be an excellent opportunity to share your ideas and research findings relevant to the Sustainability Science, through the European network of academics Papers will be published in EJSD Journal (Thompson Reuters) and Proceedings. European Center of Sustainable Development in collaboration with CIT University will organize the 5th ICSD 2017 Rome, […]

Dec, 10

cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs

The Fast Fourier Transform (FFT) is one of the most important numerical tools widely used in many scientific and engineering applications. The algorithm performs O(nlogn) operations on n input data points in order to calculate only small number of k large coefficients, while the rest of n − k numbers are zero or negligibly small. […]

CUDA

Dec, 10

Implementing and Evaluating Candidate-Based Invariant Generation

The discovery of inductive invariants lies at the heart of static program verification. This paper describes our efforts to apply candidate-based invariant generation in GPUVerify, a static checker of programs that run on GPUs. We study a set of 383 GPU programs that contain loops, drawn from a number of open source suites and vendor […]

CUDA

•

OpenCL

Dec, 10

Performance Evaluation and Optimization of HPCG benchmark on CPU + MIC platform

High-performance conjugate gradient (HPCG) is the latest benchmark adopted by the TOP500 organization, and thus how to optimize the HPCG source code for different heterogeneous computing platforms to achieve a higher floating-point computation rate has already become a new hot issue in HPC field. In the paper, we used the CPU + MIC heterogeneous computing […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Toward Automatic Translation: From OpenACC to OpenMP 4

Speedup for quantum optimal control from GPU-based automatic differentiation

Parallel Level set algorithm with MPI and accelerated on GPU

Automating the Last-Mile for High Performance Dense Linear Algebra

Translating OpenMP Device Constructs to OpenCL using Unnecessary Data Transfer Elimination

Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation

nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware

GaDei: On Scale-up Training As A Service For Deep Learning

5th International Conference on Sustainable Development (ICSD), 2017

cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs

Implementing and Evaluating Candidate-Based Invariant Generation

Performance Evaluation and Optimization of HPCG benchmark on CPU + MIC platform

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)