high performance computing on graphics processing units: hgpu.org

Posts

Dec, 24

Accelerating Sparse Graph Neural Networks with Tensor Core Optimization

Graph neural networks (GNNs) have seen extensive application in domains such as social networks, bioinformatics, and recommendation systems. However, the irregularity and sparsity of graph data challenge traditional computing methods, which are insufficient to meet the performance demands of GNNs. Recent research has explored parallel acceleration using CUDA Cores and Tensor Cores, but significant challenges […]

CUDA

Dec, 24

HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages

Large Language Model (LLM) based coding tools have been tremendously successful as software development assistants, yet they are often designed for general purpose programming tasks and perform poorly for more specialized domains such as high performance computing. Creating specialized models and tools for these domains is crucial towards gaining the benefits of LLMs in areas […]

CUDA

Dec, 24

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

Scientific and engineering problems are frequently governed by partial differential equations; however, the analytical solutions of these equations are often impractical, thereby forcing the adoption of numerical methods. Basic Linear Algebra Subprograms (BLAS) operations constitute a fundamental component of these numerical approaches, incorporating essential tasks such as Level 1 operations (dot products and vector addition), […]

CUDA

Dec, 15

Deep Learning Model Security: Threats and Defenses

Deep learning has transformed AI applications but faces critical security challenges, including adversarial attacks, data poisoning, model theft, and privacy leakage. This survey examines these vulnerabilities, detailing their mechanisms and impact on model integrity and confidentiality. Practical implementations, including adversarial examples, label flipping, and backdoor attacks, are explored alongside defenses such as adversarial training, differential […]

Dec, 15

Analyzing the Performance Portability of SYCL across CPUs, GPUs, and Hybrid Systems with Protein Database Search

The high-performance computing (HPC) landscape is undergoing rapid transformation, with an increasing emphasis on energy-efficient and heterogeneous computing environments. This comprehensive study extends our previous research on SYCL’s performance portability by evaluating its effectiveness across a broader spectrum of computing architectures, including CPUs, GPUs, and hybrid CPU-GPU configurations from NVIDIA, Intel, and AMD. Our analysis […]

CUDA

Dec, 15

Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems

The exponential growth of data-intensive machine learning workloads has exposed significant limitations in conventional GPU-accelerated systems, especially when processing datasets exceeding GPU DRAM capacity. We propose MQMS, an augmented in-storage GPU architecture and simulator that is aware of internal SSD states and operations, enabling intelligent scheduling and address allocation to overcome performance bottlenecks caused by […]

Dec, 15

RTCUDB: Building Databases with RT Processors

A spectrum of new hardware has been studied to accelerate database systems in the past decade. Specifically, CUDA cores are known to benefit from the fast development of GPUs and make notable performance improvements. The state-of-the-art GPU-based implementation, i.e., Crystal, can achieve up to 61 times higher performance than CPU-based implementations. However, experiments show that […]

CUDA

Dec, 15

Leveraging the potential of task-based programming with OpenMP task graphs

The task execution model is widely used in computer engineering, it helps developers to design, develop and understand software systems. OpenMP is the de-facto programming model to parallelize sequential algorithms on shared-memory machines. Coupled with the task parallelization, OpenMP is able to conveniently parallelize structured and non-structured applications, it also allows users to offload work […]

CUDA

Dec, 8

LLOR: Automated Repair of OpenMP Programs

In this paper, we present a technique for repairing data race errors in parallel programs written in C/C++ and Fortran using the OpenMP API. Our technique can also remove barriers that are deemed unnecessary for correctness. We implement these ideas in our tool called LLOR, which takes a language-independent approach to provide appropriate placements of […]

Dec, 8

Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach

Deep Neural Networks (DNNs) have revolutionized various fields, but their deployment on GPUs often leads to significant energy consumption. Unlike existing methods for reducing GPU energy consumption, which are either hardware-inflexible or limited by workload constraints, this paper addresses the problem at the GPU kernel level. We propose a novel search-based compilation method to generate […]

CUDA

Dec, 8

Unified schemes for directive-based GPU offloading

GPU is the dominant accelerator device due to its high performance and energy efficiency. Directive-based GPU offloading using OpenACC or OpenMP target is a convenient way to port existing codes originally developed for multicore CPUs. Although OpenACC and OpenMP target provide similar features, both methods have pros and cons. OpenACC has better functions and an […]

Dec, 8

FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran

Due to its historical popularity, Fortran was used to implement many important scientific applications. The complexity of these applications along with the transition to modern high performance languages like C++ has made modernization and optimization challenging for these applications. Significant development time is incurred to understand and optimize key algorithms as well as leverage new […]

CUDA

•

OpenCL

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Posts

Accelerating Sparse Graph Neural Networks with Tensor Core Optimization

HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

Deep Learning Model Security: Threats and Defenses

Analyzing the Performance Portability of SYCL across CPUs, GPUs, and Hybrid Systems with Protein Database Search

Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems

RTCUDB: Building Databases with RT Processors

Leveraging the potential of task-based programming with OpenMP task graphs

LLOR: Automated Repair of OpenMP Programs

Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach

Unified schemes for directive-based GPU offloading

FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Most viewed papers (last 30 days)