high performance computing on graphics processing units: hgpu.org

Posts

Apr, 29

Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime

There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. […]

OpenCL

Apr, 29

Parallel Subgraph Mining on Hybrid Platforms: HPC Systems, Multi-Cores and GPUs

Frequent subgraph mining (FSM) is an important problem in numerous application areas, such as computational chemistry, bioinformatics, social networks, computer programming languages, etc. However, the problem is computationally hard because it requires enumerating possibly an exponential number of candidate subgraph patterns, and checking their presence in a single large graph or a database of graphs. […]

CUDA

Apr, 29

Adaptive GPU Array Layout Auto-Tuning

Optimal performance is an important goal in compute intensive applications. For GPU applications, this requires a lot of experience and knowledge about the algorithms and the underlying hardware, making them an ideal target for autotuning approaches. We present an auto-tuner which optimizes array layouts in CUDA applications. Depending on the data and program parameters, kernels […]

CUDA

Apr, 29

A Survey of Cache Bypassing Techniques

With increasing core-count, the cache demand of modern processors has also increased. However, due to strict area/power budgets and presence of poor data-locality workloads, blindly scaling cache capacity is both infeasible and ineffective. Cache bypassing is a promising technique to increase effective cache capacity without incurring power/area costs of a larger sized cache. However, injudicious […]

Apr, 26

GPU-Aware Non-contiguous Data Movement In Open MPI

Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applications. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non-contiguous data movements for GPU-resident data […]

CUDA

Apr, 26

Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures

The alpaka library defines and implements an abstract hierarchical redundant parallelism model. This model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a […]

CUDA

Apr, 26

To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures

Architecture designers tend to integrate both CPU and GPU on the same chip to deliver energy-efficient designs. To effectively leverage the power of both CPUs and GPUs on integrated architectures, researchers have recently put substantial efforts into co-running a single application on both the CPU and the GPU of such architectures. However, few studies have […]

OpenCL

Apr, 26

Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging

Many graphics and vision problems are naturally expressed as optimizations with either linear or non-linear least squares objective functions over visual data, such as images and meshes. The mathematical descriptions of these functions are extremely concise, but their implementation in real code is tedious, especially when optimized for real-time performance in interactive applications. We propose […]

CUDA

Apr, 26

CMA-ES for Hyperparameter Optimization of Deep Neural Networks

Hyperparameters of deep neural networks are often optimized by grid search, random search or Bayesian optimization. As an alternative, we propose to use the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which is known for its state-of-the-art performance in derivative-free optimization. CMA-ES has some useful invariance properties and is friendly to parallel evaluations of solutions. We […]

CUDA

Apr, 24

GPL: A GPU-based Pipelined Query Processing Engine

Graphics Processing Units (GPUs) have evolved as a powerful query co-processor for main memory On-Line Analytical Processing (OLAP) databases. However, existing GPU-based query processors adopt a kernel-based execution approach which optimizes individual kernels for resource utilization and executes the GPU kernels involved in the query plan one by one. Such a kernel-based approach cannot utilize […]

OpenCL

Apr, 22

OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges

Benchmarking general-purpose computing on graphics processing unit (GPGPU) aims to profile and compare performance across different devices. Due to the low-level nature of most GPGPU APIs, GPGPU benchmarks are also useful for architectural exploration and program optimization. This can be challenging in mobile devices due to lack of underlying hardware details and limited profiling capabilities […]

OpenCL

Apr, 19

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems

We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime

Parallel Subgraph Mining on Hybrid Platforms: HPC Systems, Multi-Cores and GPUs

Adaptive GPU Array Layout Auto-Tuning

A Survey of Cache Bypassing Techniques

GPU-Aware Non-contiguous Data Movement In Open MPI

Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures

To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures

Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging

CMA-ES for Hyperparameter Optimization of Deep Neural Networks

GPL: A GPU-based Pipelined Query Processing Engine

OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)