high performance computing on graphics processing units: hgpu.org

Posts

Apr, 2

GPU Programming Strategies and Trends in GPU Computing

Over the last decade, there has been a growing interest in the use of graphics processing units (GPUs) for nongraphics applications. From early academic proof-of-concept papers around the year 2000, the use of GPUs has now matured to a point where there are countless industrial applications. Together with the expanding use of GPUs, we have […]

Mar, 31

Nested Data-Parallelism on the GPU

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs, but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ported to GPUs thus far use traditional data-level parallelism, performing only operations that operate uniformly over vectors. Porting algorithms that do […]

CUDA

Mar, 31

Distributed Password Cracking Platform

This project originates from the need for distribution when performing security testing-related password hash cracking. KPMG IT Advisory uses an MPI-supported John the Ripper cluster plus a separate system with several graphics cards for the cracking of password hashes. As they want to expand their operations, they wish to integrate GPU-capable machines with the current […]

Mar, 31

GHOST: GPGPU-Offloaded High Performance Storage I/O Deduplication for Primary Storage System

Data deduplication has been an effective way to eliminate redundant data mainly for backup storage systems. Since the recent primary storage systems in cloud services are expected to have the redundancy, the deduplication technique can also bring significant cost saving for the primary storage. However, the primary storage system requires high performance requirement about several […]

Mar, 31

Multi-GPU parallelization of a 3D Bayesian CT algorithm and its application on real foam reconstruction with incomplete data set

A great number of image reconstruction algorithms, based on analytical filtered backprojection, are implemented for X-ray Computed Tomography (CT) [1,2]. The limits of these methods appear when the number of projections is small, and/or not equidistributed around the object. That’s the case in the context of dynamic study of fluids in foams for example, the […]

CUDA

Mar, 31

GPGPU-Accelerated Instruction Accurate and Fast Simulation of Thousand-core Platforms

Future architectures will feature hundreds to thousands of simple processors and on-chip memories connected through a network-on-chip. Architectural simulators will remain primary tools for design space exploration, performance (and power) evaluation of these massively parallel architectures. However, architectural simulation performance is a serious concern, as virtual platforms and simulation technology are not able to tackle […]

CUDA

Mar, 31

Adaptive Input-aware Compilation for Graphics Engines

While graphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations,the tedious process of performance tuning required to optimize applicationsis an obstacle to wider adoption of GPUs. In addition to the programmability challenges posed by GPU’s complex memory hierarchy and parallelism model, a well-known application design problem is target portability across […]

CUDA

Mar, 30

A Highly Parallel Reuse Distance Analysis Algorithm on GPUs

Reuse distance analysis is a runtime approach that has been widely used to accurately model the memory system behavior of applications. However, traditional reuse distance analysis algorithms use tree-based data structures and are hard to parallelize, missing the tremendous computing power of modern architectures such as the emerging GPUs. This paper presents a highly-parallel reuse […]

CUDA

Mar, 30

Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs

We address in this paper the problem of mapping three-dimensional Fast Fourier Transforms (FFTs) onto the recent, highly multithreaded CUDA Graphics Processing Units (GPUs) and present some of the fastest known algorithms for a wide range of 3-D FFTs on the NVIDIA Tesla and Fermi architectures. We exploit the high-degree of multi-threading offered by the […]

CUDA

Mar, 30

Performance evaluation of GPU memory hierarchy using the FFT

Modern GPUs (Graphics Processing Units) are becoming more relevant in the world of HPC (High Performance Computing) thanks to their large computing power and relative low cost, however their special architecture results in more complex programming. To take advantage of their computing resources and develop efficient implementations is essential to have certain knowledge about the […]

CUDA

Mar, 30

A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications

Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on such bottlenecks for GPGPU applications. Although a handful of GPGPU profiling tools exist, most […]

CUDA

Mar, 29

Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System’s Perspective

Multicore machines equipped with accelerators are becoming increasingly popular in the High Performance Computing ecosystem. Hybrid architectures provide significantly improved energy efficiency, so that they are likely to generalize in the Manycore era. However, the complexity introduced by these architectures has a direct impact on programmability, so that it is crucial to provide portable abstractions […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

GPU Programming Strategies and Trends in GPU Computing

Nested Data-Parallelism on the GPU

Distributed Password Cracking Platform

GHOST: GPGPU-Offloaded High Performance Storage I/O Deduplication for Primary Storage System

Multi-GPU parallelization of a 3D Bayesian CT algorithm and its application on real foam reconstruction with incomplete data set

GPGPU-Accelerated Instruction Accurate and Fast Simulation of Thousand-core Platforms

Adaptive Input-aware Compilation for Graphics Engines

A Highly Parallel Reuse Distance Analysis Algorithm on GPUs

Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs

Performance evaluation of GPU memory hierarchy using the FFT

A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications

Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System’s Perspective

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)