15790

Posts

May, 3

Exposing Errors Related to Weak Memory in GPU Applications

We present the systematic design of a testing environment that uses stressing and fuzzing to reveal errors in GPU applications that arise due to weak memory effects. We evaluate our approach on seven GPUs spanning three Nvidia architectures, across ten CUDA applications that use fine-grained concurrency. Our results show that applications that rarely or never […]
Apr, 29

Array Program Transformation with Loo.py by Example: High-Order Finite Elements

To concisely and effectively demonstrate the capabilities of our program transformation system Loo.py, we examine a transformation path from two real-world Fortran subroutines as found in a weather model to a single high-performance computational kernel suitable for execution on modern GPU hardware. Along the transformation path, we encounter kernel fusion, vectorization, prefetching, parallelization, and algorithmic […]
Apr, 29

On the design of sparse hybrid linear solvers for modern parallel architectures

In the context of this thesis, our focus is on numerical linear algebra, more precisely on solution of large sparse systems of linear equations. We focus on designing efficient parallel implementations of MaPHyS, an hybrid linear solver based on domain decomposition techniques. First we investigate the MPI+threads approach. In MaPHyS, the first level of parallelism […]
Apr, 29

Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime

There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. […]
Apr, 29

Adaptive GPU Array Layout Auto-Tuning

Optimal performance is an important goal in compute intensive applications. For GPU applications, this requires a lot of experience and knowledge about the algorithms and the underlying hardware, making them an ideal target for autotuning approaches. We present an auto-tuner which optimizes array layouts in CUDA applications. Depending on the data and program parameters, kernels […]
Apr, 29

Parallel Subgraph Mining on Hybrid Platforms: HPC Systems, Multi-Cores and GPUs

Frequent subgraph mining (FSM) is an important problem in numerous application areas, such as computational chemistry, bioinformatics, social networks, computer programming languages, etc. However, the problem is computationally hard because it requires enumerating possibly an exponential number of candidate subgraph patterns, and checking their presence in a single large graph or a database of graphs. […]
Apr, 29

A Survey of Cache Bypassing Techniques

With increasing core-count, the cache demand of modern processors has also increased. However, due to strict area/power budgets and presence of poor data-locality workloads, blindly scaling cache capacity is both infeasible and ineffective. Cache bypassing is a promising technique to increase effective cache capacity without incurring power/area costs of a larger sized cache. However, injudicious […]
Apr, 26

To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures

Architecture designers tend to integrate both CPU and GPU on the same chip to deliver energy-efficient designs. To effectively leverage the power of both CPUs and GPUs on integrated architectures, researchers have recently put substantial efforts into co-running a single application on both the CPU and the GPU of such architectures. However, few studies have […]
Apr, 26

GPU-Aware Non-contiguous Data Movement In Open MPI

Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applications. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non-contiguous data movements for GPU-resident data […]
Apr, 26

Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures

The alpaka library defines and implements an abstract hierarchical redundant parallelism model. This model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a […]
Apr, 26

Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging

Many graphics and vision problems are naturally expressed as optimizations with either linear or non-linear least squares objective functions over visual data, such as images and meshes. The mathematical descriptions of these functions are extremely concise, but their implementation in real code is tedious, especially when optimized for real-time performance in interactive applications. We propose […]
Apr, 26

CMA-ES for Hyperparameter Optimization of Deep Neural Networks

Hyperparameters of deep neural networks are often optimized by grid search, random search or Bayesian optimization. As an alternative, we propose to use the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which is known for its state-of-the-art performance in derivative-free optimization. CMA-ES has some useful invariance properties and is friendly to parallel evaluations of solutions. We […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org