29255

Posts

Jun, 16

Understanding GPU Triggering APIs for MPI+X Communication

GPU-enhanced architectures are now dominant in HPC systems, but message-passing communication involving GPUs with MPI has proven to be both complex and expensive, motivating new approaches that lower such costs. We compare and contrast stream/graph- and kernel-triggered MPI communication abstractions, whose principal purpose is to enhance the performance of communication when GPU kernels create or […]
Jun, 16

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Over the last ten years, graphics processors have become the de facto accelerator for data-parallel tasks in various branches of high-performance computing, including machine learning and computational sciences. However, with the recent introduction of AMD-manufactured graphics processors to the world’s fastest supercomputers, tuning strategies established for previous hardware generations must be re-evaluated. In this study, […]
Jun, 16

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors […]
Jun, 16

A methodology for comparing optimization algorithms for auto-tuning

Adapting applications to optimally utilize available hardware is no mean feat: the plethora of choices for optimization techniques are infeasible to maximize manually. To this end, auto-tuning frameworks are used to automate this task, which in turn use optimization algorithms to efficiently search the vast searchspaces. However, there is a lack of comparability in studies […]
Jun, 16

How much can we gain from Tensor Kernel Fusion on GPUs?

Kernel fusion is a crucial optimization technique for GPU applications, particularly deep neural networks, where it involves combining multiple consecutive kernels into a single larger kernel. This approach aims to enhance performance by reducing the need for slow off-chip memory accesses. Instead, intermediate results between successive kernels are stored in faster on-chip memory like shared […]
Jun, 9

Memory Interference and Performance Prediction in GPU-Accelerated Heterogeneous Systems

Nowadays, a variety of applications, including automated factories, autonomous vehicles, and Cyber Physical Systems (CPS), are experiencing significant growth. Given the diverse range of challenges that must be addressed, such as real-time management and visualization of a factory’s current state through a 3D digital twin, trajectory calculation within autonomous vehicles, visualizing Human Machine Interfaces (HMI), […]
Jun, 9

Gaining Cross-Platform Parallelism for HAL’s Molecular Dynamics Package using SYCL

Molecular dynamics simulations are one of the methods in scientific computing that benefit from GPU acceleration. For those devices, SYCL is a promising API for writing portable codes. In this paper, we present the case study of "HAL’s MD package" that has been successfully migrated from CUDA to SYCL. We describe the different strategies that […]
Jun, 9

More Bang For Your Buck(et): Fast and Space-efficient Hardware-accelerated Coarse-granular Indexing on GPUs

In recent work, we have shown that NVIDIA’s raytracing cores on RTX video cards can be exploited to realize hardware-accelerated lookups for GPU-resident database indexes. On a high level, the concept materializes all keys as triangles in a 3D scene and indexes them. Lookups are performed by firing rays into the scene and utilizing the […]
Jun, 9

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Matrix multiplication is a cornerstone operation in a wide array of scientific fields, including machine learning and computer graphics. The standard algorithm for matrix multiplication has a complexity of O(n3) for n×n matrices. Strassen’s algorithm improves this to O(n2.807), but its practicality is limited for small to medium matrix sizes due to the large number […]
Jun, 9

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem for a directed, weighted graph, whose nodes represent GPU instances and edges capture both […]
Jun, 2

Addressing Challenges in Utilizing GPUs for Accelerating Privacy-Preserving Computation

Cloud computing increasingly handles confidential data, like private inference and query databases. Two strategies are used for secure computation: (1) employing CPU Trusted Execution Environments (TEEs) like AMD SEV, Intel SGX, or ARM TrustZone, and (2) utilizing emerging cryptographic methods like Fully Homomorphic Encryption (FHE) with libraries such as HElib, Microsoft SEAL, and PALISADE. To […]
Jun, 2

Machine learning enhanced code optimization for high-level synthesis (ML-ECOHS)

While Field-Programmable Gate Arrays (FPGAs) exist in many design configurations throughout the data center, cloud, and edge, the promise of performance and flexibility offered by the FPGA often remains unrealized for lack of hardware design expertise, with most computation remaining in fixed hardware such as CPUs, GPUs, and ASICs e.g. tensor processors. Identifying programmability as […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org