28720

Posts

Nov, 5

Applying the Midas Touch of Reproducibility to High-Performance Computing

With the serial performance of CPUs improving exponentially through the 1980s and 1990s and then plateauing by the mid-2000s, the high-performance computing community has seen parallel computing become ubiquitous, which, in turn, has led to a proliferation of parallel programming models. This diversity in hardware platform and programming model has forced programmers to port their […]
Nov, 5

Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs

The recent progress of AI can be largely attributed to large language models (LLMs). However, their escalating memory requirements introduce challenges for machine learning (ML) researchers and engineers. Addressing this requires developers to partition a large model to distribute it across multiple GPUs or TPUs. This necessitates considerable coding and intricate configuration efforts with existing […]
Nov, 5

A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models

For many years, systems running Nvidia-based GPU architectures have dominated the heterogeneous supercomputer landscape. However, recently GPU chipsets manufactured by Intel and AMD have cut into this market and can now be found in some of the world’s fastest supercomputers. The June 2023 edition of the TOP500 list of supercomputers ranks the Frontier supercomputer at […]
Nov, 5

OpenRAND: A Performance Portable, Reproducible Random Number Generation Library for Parallel Computations

We introduce OpenRAND, a C++17 library aimed at facilitating reproducible scientific research through the generation of statistically robust and yet replicable random numbers. OpenRAND accommodates single and multi-threaded applications on CPUs and GPUs and offers a simplified, user-friendly API that complies with the C++ standard’s random number engine interface. It is portable: it functions seamlessly […]
Oct, 29

Performance portability evaluation of blocked stencil computations on GPUs

In this new era where multiple GPU vendors are leading the supercomputing landscape, and multiple programming models are available to users, the drive to achieve performance portability across platforms faces new challenges. Consider stencil algorithms, where architecture-specific solutions are required to optimize for the parallelism hierarchy and memory hierarchy of emerging systems. In this work, […]
Oct, 29

Dynamic autotuning of SpMV kernel in CUSP library

Sparse matrix-vector product (SpMV) is a central operation in many iterative methods for solving linear systems and as such is an attractive candidate for acceleration on the GPU. However, the performance of the SpMV kernel can vary depending both on the target architecture as well as on the sparsity pattern of the matrix. Thus, to […]
Oct, 29

Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies

GPU-embedded systems have gained popularity across various domains due to their efficient power consumption. However, in order to meet the demands of real-time or time-consuming applications running on these systems, it is crucial for them to be tuned to exhibit high performance. This paper addresses the issue by developing and comparing two tuning methodologies on […]
Oct, 29

A Performance-Portable SYCL Implementation of CRK-HACC for Exascale

The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining multiple versions of their code. It is necessary to document experiences with such programming models to assist developers in understanding the advantages […]
Oct, 29

GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation

Parallel accelerators, such as GPUs, are key enablers for large-scale Machine Learning (ML) applications. However, ML model developers often lack detailed knowledge of the underlying system architectures, while system programmers usually do not have a high-level understanding of the ML model that runs on the specific system. To mitigate this gap between two relevant aspects […]
Oct, 22

Performance/power assessment of CNN packages on embedded automotive platforms

The rise of power-efficient embedded computers based on highly-parallel accelerators opens a number of opportunities and challenges for researchers and engineers, and paved the way to the era of edge computing. At the same time, advances in embedded AI for object detection and categorization such as YOLO, GoogleNet and AlexNet reached an unprecedented level of […]
Oct, 22

Performance portability analysis of SYCL with a classical CG on CPU, GPU, and FPGA

In this work, the capability of SYCL™ to execute code on different hardware devices is investigated. This motivates conducting a performance portability analysis. The architectures investigated are the CPU, GPU, and FPGA. As a benchmark algorithm, the CG algorithm is used, as it is widely applicable to many fields and is more complex than simple […]
Oct, 22

Predicting the Execution Time of a kernel on a specific GPU using PTX code

During the last couple of decades, there has been an exponential growth in the amount of time and energy required to run workloads on high-performance computing systems, which nowadays rely heavily upon GPUs. In order to reduce the resources required by these systems, one clear approach is to avoid inefficient applications by using prediction models […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: