29585

Posts

Dec, 8

Unified schemes for directive-based GPU offloading

GPU is the dominant accelerator device due to its high performance and energy efficiency. Directive-based GPU offloading using OpenACC or OpenMP target is a convenient way to port existing codes originally developed for multicore CPUs. Although OpenACC and OpenMP target provide similar features, both methods have pros and cons. OpenACC has better functions and an […]
Dec, 8

FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran

Due to its historical popularity, Fortran was used to implement many important scientific applications. The complexity of these applications along with the transition to modern high performance languages like C++ has made modernization and optimization challenging for these applications. Significant development time is incurred to understand and optimize key algorithms as well as leverage new […]
Dec, 8

Guardian: Safe GPU Sharing in Multi-Tenant Environments

Modern GPU applications, such as machine learning (ML), can only partially utilize GPUs, leading to GPU underutilization in cloud environments. Sharing GPUs across multiple applications from different tenants can improve resource utilization and consequently cost, energy, and power efficiency. However, GPU sharing creates memory safety concerns because kernels must share a single GPU address space. […]
Dec, 1

PyOMP: Parallel programming for CPUs and GPUs with OpenMP and Python

Python is the most popular programming language. OpenMP is the most popular parallel programming API. Projecting OpenMP into Python will help expand the HPC community. We call our Python-based OpenMP system PyOMP. In this short paper we describe PyOMP and its use for parallel programming for CPUs and GPUs. We describe its implementation through the […]
Dec, 1

CLUEstering: a high-performance density-based clustering library for scientific computing

Clustering is a computational technique that aims at classifying objects based on their similarity, and is widely used in many branches of science nowadays, for instance in image segmentation, medical imaging, study of complex systems, machine learning techniques and high-energy physics. As the amount of data collected in every field of research increases, techniques like […]
Dec, 1

Hardware Accelerators for Artificial Intelligence

In this chapter, we aim to explore an in-depth exploration of the specialized hardware accelerators designed to enhance Artificial Intelligence (AI) applications, focusing on their necessity, development, and impact on the field of AI. It covers the transition from traditional computing systems to advanced AI-specific hardware, addressing the growing demands of AI algorithms and the […]
Dec, 1

Scaling SU(2) to 1000 GPUs using HiRep

HiRep allows flexible simulations of higher representations of Wilson Fermions with various actions and gauge groups and a range of inverters and integrators. This is particularly important for enabling evaluations of observables relevant to phenomenological inputs for Beyond-the-Standard-Model physics from lattice field theory. We present progress on the GPU porting of available features, especially in […]
Dec, 1

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Analytical framework for predicting General Matrix Multiplication (GEMM) performance on modern GPUs, focusing on runtime, power consumption, and energy efficiency. Our study employs two approaches: a custom-implemented tiled matrix multiplication kernel for fundamental analysis, and NVIDIA’s CUTLASS library for comprehensive performance data collection across advanced configurations. Using the NVIDIA RTX 4070 as our experimental platform, […]
Nov, 24

A Distributed-memory Tridiagonal Solver Based on a Specialised Data Structure Optimised for CPU and GPU Architectures

Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS algorithm takes advantage of the diagonal […]
Nov, 24

SoK: A Systems Perspective on Compound AI Threats and Countermeasures

Large language models (LLMs) used across enterprises often use proprietary models and operate on sensitive inputs and data. The wide range of attack vectors identified in prior research – targeting various software and hardware components used in training and inference – makes it extremely challenging to enforce confidentiality and integrity policies. As we advance towards […]
Nov, 24

gpuPairHMM: High-speed Pair-HMM Forward Algorithm for DNA Variant Calling on GPUs

The continually increasing volume of DNA sequence data has resulted in a growing demand for fast implementations of core algorithms. Computation of pairwise alignments between candidate haplotypes and sequencing reads using Pair-HMMs is a key component in DNA variant calling tools such as the GATK HaplotypeCaller but can be highly time consuming due to its […]
Nov, 24

Performance portability via C++ PSTL, SYCL, OpenMP, and HIP: the Gaia AVU-GSR case study

Applications that analyze data from modern scientific experiments will soon require a computing capacity of ExaFLOPs. The current trend to achieve such performance is to employ GPU-accelerated supercomputers and design applications to exploit this hardware optimally. Since each supercomputer is typically a one-off project, the necessity of having computational languages portable across diverse CPU and […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: