high performance computing on graphics processing units: hgpu.org

Posts

Feb, 6

Comparison of OpenCL performance on different platforms using VexCL and Blaze

This technical report provides performance numbers for several benchmark problems running on several different hardware platforms. The goal of this report is twofold. First, it helps us better understand how the performance of OpenCL changes on different platforms. Second, it provides a OpenCL-OpenMP comparison for a sparse matrix-vector multiplication operation. The VexCL library will be […]

OpenCL

Feb, 6

Register-leaning kernels in CUDA

Kepler cards offer a giant amount of register space. One can use this memory to store working data arrays, just as one uses the shared memory. This white paper will describe such register-leaning approach in detail.

CUDA

Feb, 3

A Survey of Power Management Techniques for Phase Change Memory

The demands of larger memory capacity in high-performance computing systems have motivated the researchers to explore alternatives of DRAM (dynamic random access memory). Since PCM (phase change memory) provides high-density, good scalability and non-volatile data storage, it has received significant amount of attention in recent years. A crucial bottleneck in wide-spread adoption of PCM, however, […]

Feb, 3

Pointer Analysis for Semi-Automatic Code Parallelizers

Code parallelizers are employed these days to reduce the efforts needed in manually parallelizing sequential code. But they are ineffective when it comes to handling programming constructs like pointers. Code parallelizers like Par4all have a limited support for pointers while approaches like the ASET + BONES cannot handle pointers at all. In this thesis we […]

CUDA

•

OpenCL

Feb, 3

Exploiting Concurrency Patterns with Heterogeneous Task and Data Parallelism

Parallel programming of an application requires not only domain knowledge of the application, but also programming environment support and in-depth awareness of the target architecture. Often, all concurrency features of the architecture are not exposed to the programming environment. The challenge lies in efficient utilization of these unexposed features to write effective parallel programs. In […]

OpenCL

Feb, 3

GPGPU and MIC in Accelerated Cluster for Remote Sensed Image Processing Software

Processing of Earth observation remotely sensed images requires more and more powerful computing facilities. Since a few years, GPGPU (General Purpose processing on Graphics Processing Units) technology has been used to perform massively parallel calculations. The French Space Agency (CNES) has then made a portage of some IAS to assess their performance using this type […]

CUDA

•

OpenCL

Feb, 3

On the Accelerating of Two-dimensional Smart Laplacian Smoothing on the GPU

This paper presents a GPU-accelerated implementation of two-dimensional Smart Laplacian smoothing. This implementation is developed under the guideline of our paradigm for accelerating Laplacianbased mesh smoothing [13]. Two types of commonly used data layouts, Array-of-Structures (AoS) and Structure-of-Arrays (SoA) are used to represent triangular meshes in our implementation. Two iteration forms that have different choices […]

CUDA

Feb, 3

Scaling Recurrent Neural Network Language Models

This paper investigates the scaling properties of Recurrent Neural Network Language Models (RNNLMs). We discuss how to train very large RNNs on GPUs and address the questions of how RNNLMs scale with respect to model size, training-set size, computational costs and memory. Our analysis shows that despite being more costly to train, RNNLMs obtain much […]

CUDA

Feb, 2

Multi-GPU Support on Shared Memory System using Directive-based Programming Model

Existing and emerging studies show that using single Graphics Processing Units (GPUs) can lead to obtaining significant performance gains. These devices have tremendous processing capabilities. We should be able to achieve further orders of performance speedup if we use more than just one GPU. Heterogeneous processors consisting of multiple CPUs and GPUs offer immense potential […]

CUDA

Feb, 2

Characterizing and Enhancing Global Memory Data Coalescing on GPUs

Effective parallel programming for GPUs requires careful attention to several factors, including ensuring coalesced access of data from global memory. There is a need for tools that can provide feedback to users about statements in a GPU kernel where non-coalesced data access occurs, and assistance in fixing the problem. In this paper, we address both […]

CUDA

Feb, 2

Performance Analysis and Optimization of Hermite Methods on NVIDIA GPUs Using CUDA

In this thesis we present the first, to our knowledge, implementation and performance analysis of Hermite methods on GPU accelerated systems. We give analytic background for Hermite methods; give implementations of the Hermite methods on traditional CPU systems as well as on GPUs; give the reader background on basic CUDA programming for GPUs; discuss performance […]

CUDA

Feb, 2

Reliable Initialization of GPU-enabled Parallel Stochastic Simulations Using Mersenne Twister for Graphics Processors

Parallel stochastic simulations tend to exploit more and more computing power and they are now also developed for General Purpose Graphics Process Units (GP-GPUs). Consequently, they need reliable random sources to feed their applications. We propose a survey of the current Pseudo Random Numbers Generators (PRNG) available on GPU. We give a particular focus to […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Comparison of OpenCL performance on different platforms using VexCL and Blaze

Register-leaning kernels in CUDA

A Survey of Power Management Techniques for Phase Change Memory

Pointer Analysis for Semi-Automatic Code Parallelizers

Exploiting Concurrency Patterns with Heterogeneous Task and Data Parallelism

GPGPU and MIC in Accelerated Cluster for Remote Sensed Image Processing Software

On the Accelerating of Two-dimensional Smart Laplacian Smoothing on the GPU

Scaling Recurrent Neural Network Language Models

Multi-GPU Support on Shared Memory System using Directive-based Programming Model

Characterizing and Enhancing Global Memory Data Coalescing on GPUs

Performance Analysis and Optimization of Hermite Methods on NVIDIA GPUs Using CUDA

Reliable Initialization of GPU-enabled Parallel Stochastic Simulations Using Mersenne Twister for Graphics Processors

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)