28439

Posts

Jul, 16

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is […]
Jul, 16

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly […]
Jul, 16

Tile-based Lightweight Integer Compression in GPU

GPUs are increasingly used for high-performance and interactive data analytics workloads due to their capability to accelerate computation using massive parallelism. A key constraint of GPU-based data analytics today is the limited memory capacity in GPU devices. Data compression is a powerful technique that can mitigate the capacity limitation in two ways: (1) fitting more […]
Jul, 16

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack […]
Jul, 16

Improving the Performance, Portability, and Productivity of Hardware Accelerators

With the end of Moore’s Law and Dennard’s scaling, attention is shifting to new ways of enhancing computer performance. Improving microprocessor performance is becoming increasingly complex, whereas computational power demands still grow tremendously fast. In recent years, we are witnessing a paradigm change: rather than using one single chip, the CPU, for computing everything, computers […]
Jul, 9

Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc

Leveraging Graphics Processing Units (GPUs) to accelerate scientific software has proven to be highly successful, but in order to extract more performance, GPU programmers must overcome the high latency costs associated with their use. One method of reducing or hiding this latency cost is to use asynchronous streams to issue commands to the GPU. While […]
Jul, 9

Modeling Parallel Programs using Large Language Models

Parallel software codes in high performance computing (HPC) continue to grow in complexity and scale as we enter the exascale era. A diverse set of emerging hardware and programming paradigms make developing, optimizing, and maintaining parallel software burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. […]
Jul, 9

Optimization Techniques for GPU Programming

In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization […]
Jul, 9

Matrix Multiplication Using Only Addition

Matrix multiplication consumes a large fraction of the time taken in many machine-learning algorithms. Thus, accelerator chips that perform matrix multiplication faster than conventional processors or even GPU’s are of increasing interest. In this paper, we demonstrate a method of performing matrix multiplication without a scalar multiplier circuit. In many cases of practical interest, only […]
Jul, 9

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design […]
Jul, 2

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. We test the generated kernel codes for a variety of language-supported programming models, including (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g., OpenMP [including offload] and […]
Jul, 2

cuSLINK: Single-linkage Agglomerative Clustering on the GPU

In this paper, we propose cuSLINK, a novel and state-of-the-art reformulation of the SLINK algorithm on the GPU which requires only O(Nk) space and uses a parameter k to trade off space and time. We also propose a set of novel and reusable building blocks that compose cuSLINK. These building blocks include highly optimized computational […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: