28075

Posts

Mar, 26

DSDP: A Blind Docking Strategy Accelerated by GPUs

Virtual screening, including molecular docking, plays an essential role in drug discovery. Many traditional and machine-learning based methods are available to fulfil the docking task. The traditional docking methods are normally extensively time-consuming, and their performance in blind docking remains to be improved. Although the runtime of docking based on machine learning is significantly decreased, […]
Mar, 26

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space, several auto-tuning tools – like Kernel Tuner – have been proposed. Unfortunately, these existing auto-tuning tools often do not concern themselves with […]
Mar, 19

Challenges and Opportunities in C/C++ Source-To-Source Compilation

The C/C++ compilation stack (Intermediate Representations (IRs), compilation passes and backends) is encumbered by a steep learning curve, which we believe can be lowered by complementing it with approaches such as source-to-source compilation. Source-to-source compilation is a technology that is widely used and quite mature in certain programming environments, such as JavaScript, but that faces […]
Mar, 19

Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL

Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, […]
Mar, 19

Statistical Computing With Graphics Processing Units

This thesis consists of two main projects and a third project which is provided in the appendix. The contribution of the first project, is a tool set for parallel random number gen- eration on GPUs in R, namely, the clrng package. This package is currently the only R package that provides facilities for generating random […]
Mar, 19

Towards a Benchmarking Suite for Kernel Tuners

As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many architecture-based optimization details as possible from the user, so that the code can be used efficiently across different generations of systems. In this article […]
Mar, 19

Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems

Heterogeneity has become a mainstream architecture design choice for building High Performance Computing systems. However, heterogeneity poses significant challenges for achieving performance portability of execution. Adapting a program to a new heterogeneous platform is laborious and requires developers to manually explore a vast space of execution parameters. To address those challenges, this paper proposes new […]
Mar, 12

ARK: GPU-driven Code Execution for Distributed Deep Learning

Modern state-of-the-art deep learning (DL) applications tend to scale out to a large number of parallel GPUs. Unfortunately, we observe that the collective communication overhead across GPUs is often the key limiting factor of performance for distributed DL. It under-utilizes the networking bandwidth by frequent transfers of small data chunks, which also incurs a substantial […]
Mar, 12

BenchDirect: A Directed Language Model for Compiler Benchmarks

The exponential increase of hardware-software complexity has made it impossible for compiler engineers to find the right optimization heuristics manually. Predictive models have been shown to find near optimal heuristics with little human effort but they are limited by a severe lack of diverse benchmarks to train on. Generative AI has been used by researchers […]
Mar, 12

A Deep Learning Model for Loop Interchange

Loop interchange is an important code optimization that improves data locality and extracts parallelism. While previous research in compilers has tried to automate the selection of which loops to interchange, existing methods have an important limitation. They use less precise machine models. This is mainly because developing a model to predict whether to interchange two […]
Mar, 12

Bridging Control-Centric and Data-Centric Optimization

With the rise of specialized hardware and new programming languages, code optimization has shifted its focus towards promoting data locality. Most production-grade compilers adopt a control-centric mindset – instruction-driven optimization augmented with scalar-based dataflow – whereas other approaches provide domain-specific and general purpose data movement minimization, which can miss important control-flow optimizations. As the two […]
Mar, 12

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: