29569

Posts

Dec, 1

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Analytical framework for predicting General Matrix Multiplication (GEMM) performance on modern GPUs, focusing on runtime, power consumption, and energy efficiency. Our study employs two approaches: a custom-implemented tiled matrix multiplication kernel for fundamental analysis, and NVIDIA’s CUTLASS library for comprehensive performance data collection across advanced configurations. Using the NVIDIA RTX 4070 as our experimental platform, […]
Dec, 1

Hardware Accelerators for Artificial Intelligence

In this chapter, we aim to explore an in-depth exploration of the specialized hardware accelerators designed to enhance Artificial Intelligence (AI) applications, focusing on their necessity, development, and impact on the field of AI. It covers the transition from traditional computing systems to advanced AI-specific hardware, addressing the growing demands of AI algorithms and the […]
Nov, 24

A Distributed-memory Tridiagonal Solver Based on a Specialised Data Structure Optimised for CPU and GPU Architectures

Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS algorithm takes advantage of the diagonal […]
Nov, 24

SoK: A Systems Perspective on Compound AI Threats and Countermeasures

Large language models (LLMs) used across enterprises often use proprietary models and operate on sensitive inputs and data. The wide range of attack vectors identified in prior research – targeting various software and hardware components used in training and inference – makes it extremely challenging to enforce confidentiality and integrity policies. As we advance towards […]
Nov, 24

gpuPairHMM: High-speed Pair-HMM Forward Algorithm for DNA Variant Calling on GPUs

The continually increasing volume of DNA sequence data has resulted in a growing demand for fast implementations of core algorithms. Computation of pairwise alignments between candidate haplotypes and sequencing reads using Pair-HMMs is a key component in DNA variant calling tools such as the GATK HaplotypeCaller but can be highly time consuming due to its […]
Nov, 24

Performance portability via C++ PSTL, SYCL, OpenMP, and HIP: the Gaia AVU-GSR case study

Applications that analyze data from modern scientific experiments will soon require a computing capacity of ExaFLOPs. The current trend to achieve such performance is to employ GPU-accelerated supercomputers and design applications to exploit this hardware optimally. Since each supercomputer is typically a one-off project, the necessity of having computational languages portable across diverse CPU and […]
Nov, 24

Edify 3D: Scalable High-Quality 3D Asset Generation

We introduce Edify 3D, an advanced solution designed for high-quality 3D asset generation. Our method first synthesizes RGB and surface normal images of the described object at multiple viewpoints using a diffusion model. The multi-view observations are then used to reconstruct the shape, texture, and PBR materials of the object. Our method can generate high-quality […]
Nov, 17

GPUVM: GPU-driven Unified Virtual Memory

Graphics Processing Units (GPUs) leverage massive parallelism and large memory bandwidth to support high-performance computing applications, such as multimedia rendering, crypto-mining, deep learning, and natural language processing. These applications require models and datasets that are getting bigger in size and currently challenge the memory capacity of a single GPU, causing substantial performance overheads. To address […]
Nov, 17

Context Parallelism for Scalable Million-Token Inference

We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two […]
Nov, 17

Kokkidio: Fast, expressive, portable code, based on Kokkos and Eigen

Kokkidio is a newly developed C++ template library that combines the performance portability framework Kokkos and its strength in utilising GPUs with the expressive syntax and CPU optimisations of the linear algebra library Eigen. Its unified abstractions enable both simple data management as well as clear, succinct compute code in kernel functors, where a novel […]
Nov, 17

Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers

Mapping computations to processors and assigning data to memory are critical for maximizing performance in parallel programming. These mapping decisions are managed through the development of specialized low-level system code, called mappers, crafted by performance engineers. Each mapper is tailored to a specific application and optimized for the underlying machine architecture, a process that requires […]
Nov, 17

The Rewriting of DataRaceBench Benchmark for OpenCL Program Validations

Effective detection of data races in parallel computing environments is essential for ensuring the correctness and performance of multi-threaded applications. This paper addresses the issue with OpenCL data racing analysis. Currently, for the data racing research, there is a well-established DataRaceBench benchmark, designed for OpenMP. In our research, we rewrite the OpenMP DataRaceBench benchmark for […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org