high performance computing on graphics processing units: hgpu.org

Posts

Jan, 29

A Detailed GPU Cache Model Based on Reuse Distance Theory

As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means […]

CUDA

Jan, 29

Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators

The use of linear algebra routines is fundamental to many areas of computational science, yet their implementation in software still forms the main computational bottleneck in many widely used algorithms. In machine learning and computational statistics, for example, the use of Gaussian distributions is ubiquitous, and routines for calculating the Cholesky decomposition, matrix inverse and […]

CUDA

Jan, 29

Consolidating Applications for Energy Efficiency in Heterogeneous Computing Systems

By scheduling multiple applications with complementary resource requirements on a smaller number of compute nodes, we aim to improve performance, resource utilization, energy consumption, and energy efficiency simultaneously. In addition to our naive consolidation approach, which already achieves the aforementioned goals, we propose a new energy efficiency-aware (EEA) scheduling policy and compare its performance with […]

CUDA

•

OpenCL

Jan, 29

Wideband Channelization for Software-Defined Radio via Mobile Graphics Processors

Wideband channelization is a computationally intensive task within software-defined radio (SDR). To support this task, the underlying hardware should provide high performance and allow flexible implementations. Traditional solutions use field-programmable gate arrays (FPGAs) to satisfy these requirements. While FPGAs allow for flexible implementations, realizing a FPGA implementation is a difficult and time-consuming process. On the […]

OpenCL

Jan, 29

On the Programmability and Performance of Heterogeneous Platforms

General-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures presents many challenges to the domain scientist, including device selection, programming model, and level of investment in optimization. All of these choices influence the […]

CUDA

Jan, 29

A Performance Criteria for parallel Computation on basis of block size using CUDA Architecture

GPU based on CUDA Architecture developed by NVIDIA is a high performance computing device. Multiplication of matrices of large order can be computed in few seconds using GPU based on CUDA Architecture. A modern GPU consists of 16 highly threaded streaming multiprocessors (SMs). GPU named Fermi consists of 32 SMs. These are computing intensive devices. […]

CUDA

Jan, 29

Impact of communication times on mixed CPU/GPU applications scheduling using KAAPI

High Performance Computing machines use more and more Graphical Processing Units as they are very efficient for homogeneous computation such as matrix operations. However before using these accelerators, one has to transfer data from the processor to them. Such a transfer can be slow. In this report, our aim is to study the impact of […]

CUDA

Jan, 28

Scheduling on Manycore and Heterogeneous Graphics Processors

Through custom software schedulers that distribute work differently than built-in hardware schedulers, data-parallel and heterogenous architectures can be retargeted towards irregular task-parallel graphics workloads. This dissertation examines the role of a GPU scheduler and how it may schedule complicated workloads onto the GPU for efficient parallel processing. This dissertation examines the scheduler through three different […]

CUDA

•

OpenCL

Jan, 28

Automatic Resource-Constrained Static Task Parallelization

This thesis intends to show how to efficiently exploit the parallelism present in applications in order to enjoy the performance benefits that multiprocessors can provide, using a new automatic task parallelization methodology for compilers. The key characteristics we focus on are resource constraints and static scheduling. This methodology includes the techniques required to decompose applications […]

OpenCL

Jan, 28

GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications

While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault injector for GPGPU applications is challenging due to their massive parallelism, which […]

CUDA

Jan, 28

Performance-Correctness Challenges in Emerging Heterogeneous Multicore Processors

We are witnessing a tremendous amount of change in the design of the modern microprocessor. With dozens of CPU cores on-chip recent multicore processors, the search for thread-level parallelism (TLP) is more significant than ever. In parallel, a very different processor architecture has emerged that aims to extract parallelism at an entirely different scale. Originally […]

CUDA

Jan, 28

Autotuning Programs with Algorithmic Choice

The process of optimizing programs and libraries, both for performance and quality of service, can be viewed as a search problem over the space of implementation choices. This search is traditionally manually conducted by the programmer and often must be repeated when systems, tools, or requirements change. The overriding goal of this work is to […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

A Detailed GPU Cache Model Based on Reuse Distance Theory

Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators

Consolidating Applications for Energy Efficiency in Heterogeneous Computing Systems

Wideband Channelization for Software-Defined Radio via Mobile Graphics Processors

On the Programmability and Performance of Heterogeneous Platforms

A Performance Criteria for parallel Computation on basis of block size using CUDA Architecture

Impact of communication times on mixed CPU/GPU applications scheduling using KAAPI

Scheduling on Manycore and Heterogeneous Graphics Processors

Automatic Resource-Constrained Static Task Parallelization

GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications

Performance-Correctness Challenges in Emerging Heterogeneous Multicore Processors

Autotuning Programs with Algorithmic Choice

Recent source codes

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Hercules: A Compiler for Productive Programming of Heterogeneous Systems

Celerity Runtime: High-level C++ for Accelerator Clusters

wgpy: WebGL accelerated numpy-compatible array library for web browser

Microbenchmarking OpenMP target offload with Catch2

SUperman: Highly Efficient Permanent Computation Library

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

Most viewed papers (last 30 days)