high performance computing on graphics processing units: hgpu.org

Posts

Jun, 16

Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application

Graphics Processing Units (GPUs) are becoming popular accelerators in modern High-Performance Computing (HPC) clusters. Installing GPUs on each node of the cluster is not efficient resulting in high costs and power consumption as well as underutilisation of the accelerator. The research reported in this paper is motivated towards the use of few physical GPUs by […]

CUDA

Jun, 14

International Conference on Robotics and Machine Vision (ICRMV’16), 2016

Index: Scopus, Ei Compendex, Web of Science (CPCI), Inspec, Google Scholar, Microsoft Academic Search, etc. AGENDA: September 14, 2016: Registration & Conference Materials Collection September 15, 2016: Keynote Speeches & Participants’ Oral Presentation September 16, 2016: Visit PUBLICATION: ICRMV 2016 conference Proceedings CONTACT US: Ms.Janet Hsiao E-mail: icrmv@academic.net

Jun, 14

International Conference on Cybernetics, Robotics and Control (ICCRC’16), 2016

Publication: All accepted papers of CRC 2016 (Registered & Presented) will be collected in the conference proceedings, which will be indexed by EI and Scopus. Selected papers will be published in International Journal of Mechanical Engineering and Robotics Research, (ISSN: 2278-0149) which is Indexed by Index Corpernicus, Scopus (since 2016) etc. Contact: Ethell Shin E-mail: […]

Jun, 14

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond

With the appearance of the heterogeneous platform OpenPower,many-core accelerator devices have been coupled with Power host processors for the first time. Towards utilizing their full potential, it is worth investigating performance portable algorithms that allow to choose the best-fitting hardware for each domain-specific compute task. Suiting even the high level of parallelism on modern GPGPUs, […]

CUDA

Jun, 14

First Application of Lattice QCD to Pezy-SC Processor

Pezy-SC processor is a novel new architecture developed by Pezy Computing K. K. that has achieved large computational power with low electric power consumption. It works as an accelerator device similarly to GPGPUs. A programming environment that resembles OpenCL is provided. Using a hybrid parallel system "Suiren" installed at KEK, we port and tune a […]

OpenCL

Jun, 14

OpenCL-Based Erasure Coding on Heterogeneous Architectures

Erasure coding, Reed-Solomon coding in particular, is a key technique to deal with failures in scale-out storage systems. However, due to the algorithmic complexity, the performance overhead of erasure coding can become a significant bottleneck in storage systems attempting to meet service level agreements (SLAs). Previous work has mainly leveraged SIMD (singleinstruction multiple-data) instruction extensions […]

OpenCL

Jun, 14

Processing Big Data in Main Memory and on GPU

Many large-scale systems were designed with the assumption that I/O is the bottleneck, but this assumption has been challenged in the past decade with new trends in hardware capabilities and workload demands. The computational power of CPU cores has not improved proportional to the performance of disks and network interfaces in the past decade, but […]

CUDA

•

OpenCL

Jun, 14

Multi-GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

Using modern Graphic Processing Units (GPUs) becomes very useful for computing complex and time consuming processes. GPUs provide high-performance computation capabilities with a good price. This paper deals with a multi-GPU OpenCL and CUDA implementations of k-Nearest Neighbor (k-NN) algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for […]

CUDA

•

OpenCL

Jun, 9

Analysis and Parameter Prediction of Compiler Transformation for Graphics Processors

In the last decade graphics processors (GPUs) have been extensively used to solve computationally intensive problems. A variety of GPU architectures by different hardware manufacturers have been shipped in a few years. OpenCL has been introduced as the standard cross-vendor programming framework for GPU computing. Writing and optimising OpenCL applications is a challenging task, the […]

OpenCL

Jun, 9

Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide […]

CUDA

•

OpenCL

Jun, 9

OpenMP Parallelization and Optimization of Graph-based Machine Learning Algorithms

We investigate the OpenMP parallelization and optimization of two novel data classification algorithms. The new algorithms are based on graph and PDE solution techniques and provide significant accuracy and performance advantages over traditional data classification algorithms in serial mode. The methods leverage the Nystrom extension to calculate eigenvalue/eigenvectors of the graph Laplacian and this is […]

Jun, 9

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Sparse matrix vector multiplication (SpMV) is the dominant kernel in scientific simulations. Many-core processors such as GPUs accelerate SpMV computations with high parallelism and memory bandwidth compared to CPUs; however, even for many-core processors the performance of SpMV is still strongly limited by memory bandwidth and lower locality of memory access to input vector causes […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application

International Conference on Robotics and Machine Vision (ICRMV’16), 2016

International Conference on Cybernetics, Robotics and Control (ICCRC’16), 2016

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond

First Application of Lattice QCD to Pezy-SC Processor

OpenCL-Based Erasure Coding on Heterogeneous Architectures

Processing Big Data in Main Memory and on GPU

Multi-GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

Analysis and Parameter Prediction of Compiler Transformation for Graphics Processors

Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

OpenMP Parallelization and Optimization of Graph-based Machine Learning Algorithms

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)