29715

Posts

Feb, 3

On the Partitioning of GPU Power among Multi-Instances

Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power consumption. NVIDIA’s Multi-Instance GPU (MIG) technology improves GPU utilization by enabling isolated partitions with per-partition resource tracking, facilitating GPU sharing by multiple […]
Feb, 3

Profiling Apple Silicon Performance for ML Training

Apple Silicon has attracted much attention for its performance and role in machine learning (ML) training. Unlike NVIDIA GPUs, which have traditionally dominated ML training, Apple Silicon has a significant difference in memory architecture. It uses Unified Memory, which integrates CPU and GPU memory instead of separate CPU memory and GPU VRAM. However, it is […]
Jan, 27

Column-Oriented Datalog on the GPU

Datalog is a logic programming language widely used in knowledge representation and reasoning (KRR), program analysis, and social media mining due to its expressiveness and high performance. Traditionally, Datalog engines use either row-oriented or column-oriented storage. Engines like VLog and Nemo favor column-oriented storage for efficiency on limited-resource machines, while row-oriented engines like Souffle use […]
Jan, 27

Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?

To match the blooming demand of generative AI workloads, GPU designers have so far been trying to pack more and more compute and memory into single complex and expensive packages. However, there is growing uncertainty about the scalability of individual GPUs and thus AI clusters, as state-of-the-art GPUs are already displaying packaging, yield, and cooling […]
Jan, 27

Exploring data flow design and vectorization with oneAPI for streaming applications on CPU+GPU

In recent times, oneAPI has emerged as a competitive framework to optimize streaming applications on heterogeneous CPU+GPU architectures, since it provides portability and performance thanks to the SYCL programming language and efficient parallel libraries as oneTBB. However, this approach opens up a wealth of implementations alternatives in this type of applications: from how to design […]
Jan, 27

Adaptive Optimization Techniques for High-Performance Computing

The dataset sizes and computing needs of increasingly prevalent high-performance computing (HPC) applications have grown exponentially over the last decade. Moreover, modern computing architectures are evolving with different paradigms, and accelerators have become indispensable parts of computing. Consequently, the imperative for performance optimization for HPC applications and intelligent resource management for evolving architectures has become […]
Jan, 27

Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis

Modern GPUs, with their specialized hardware like tensor cores, are essential for demanding AI and deep learning applications. This study presents a comprehensive, multi-level microbenchmarking analysis of the NVIDIA Hopper GPU architecture, delving into its performance characteristics and novel features. We benchmark Hopper’s memory subsystem latency and throughput, comparing its L2 partitioned cache behavior and […]
Jan, 20

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU

Fully homomorphic encryption (FHE) and zero-knowledge proofs (ZKPs) are emerging as solutions for data security in distributed environments. However, the widespread adoption of these encryption techniques is hindered by their significant computational overhead, primarily resulting from core cryptographic operations that involve large integer arithmetic. This paper presents a formalization of multi-word modular arithmetic (MoMA), which […]
Jan, 20

GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism

The evolution of Graphics Processing Units (GPUs) has allowed the industry to overcome long-lasting problems and challenges. Many belong to the stream processing domain, whose central aspect is continuously receiving and processing data from streaming data producers such as cameras and sensors. Nonetheless, programming GPUs is challenging because it requires deep knowledge of many-core programming, […]
Jan, 20

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the overhead from launching several fine-grained kernels. CUDA Graph addresses these performance challenges by enabling a graph-based execution model that captures operations as nodes and dependence […]
Jan, 20

Keras Sig: Efficient Path Signature Computation on GPU in Keras 3

In this paper we introduce Keras Sig a high-performance pythonic library designed to compute path signature for deep learning applications. Entirely built in Keras 3, Keras Sig leverages the seamless integration with the mostly used deep learning backends such as PyTorch, JAX and TensorFlow. Inspired by Kidger and Lyons (2021),we proposed a novel approach reshaping […]
Jan, 20

A User’s Guide to KSig: GPU-Accelerated Computation of the Signature Kernel

The signature kernel is a positive definite kernel for sequential and temporal data that has become increasingly popular in machine learning applications due to powerful theoretical guarantees, strong empirical performance, and recently introduced various scalable variations. In this chapter, we give a short introduction to KSig, a Scikit-Learn compatible Python package that implements various GPU-accelerated […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org