29133

Posts

Mar, 3

Parallel programming in mobile devices with FancyJCL

Mobile devices and handheld systems, such as the smartphones and tablets universally extended, are becoming increasingly powerful. Their basic hardware configuration is usually state-of-the-art heterogeneous architectures consisting of multi-core processors and some kind of accelerator such as GPUs or DSPs. Specific code adapted to the architecture is mandatory if high-performance computation is required and low-level […]
Mar, 3

Low-Overhead Trace Collection and Profiling on GPU Compute Kernels

While GPUs can bring substantial speedup to compute-intensive tasks, their programming is notoriously hard. From their programming model, to microarchitectural particularities, the programmer may encounter many pitfalls which may hinder performance in obscure ways. Numerous performance analysis tools provide helpful data on the efficiency of the compute kernels, but few allow the programmer to efficiently […]
Feb, 25

APPy: Annotated Parallelism for Python on GPUs

GPUs are increasingly being used used to speed up Python applications in the scientific computing and machine learning domains. Currently, the two common approaches to leveraging GPU acceleration in Python are 1) create a custom native GPU kernel, and import it as a function that can be called from Python; 2) use libraries such as […]
Feb, 25

Analyzing GPU Performance in Virtualized Environments: A Case Study

The graphics processing unit (GPU) plays a crucial role in boosting application performance and enhancing computational tasks. Thanks to its parallel architecture and energy efficiency, the GPU has become essential in many computing scenarios. On the other hand, the advent of GPU virtualization has been a significant breakthrough, as it provides scalable and adaptable GPU […]
Feb, 25

Assessing opportunities of SYCL for biological sequence alignment on GPU-based systems

Bioinformatics and computational biology are two fields that have been exploiting GPUs for more than two decades, with being CUDA the most used programming language for them. However, as CUDA is an NVIDIA proprietary language, it implies a strong portability restriction to a wide range of heterogeneous architectures, like AMD or Intel GPUs. To face […]
Feb, 25

Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures

Deep Learning (DL) frameworks such as PyTorch and TensorFlow include runtime infrastructures responsible for executing trained models on target hardware, managing memory, data transfers, and multi-accelerator execution, if applicable. Additionally, it is a common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of interchange formats […]
Feb, 25

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

Graphics processing units (GPUs) are continually evolving to cater to the computational demands of contemporary general-purpose workloads, particularly those driven by artificial intelligence (AI) utilizing deep learning techniques. A substantial body of studies have been dedicated to dissecting the microarchitectural metrics characterizing diverse GPU generations, which helps researchers understand the hardware details and leverage them […]
Feb, 18

pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a […]
Feb, 18

TransAxx: Efficient Transformers with Approximate Computing

Vision Transformer (ViT) models which were recently introduced by the transformer architecture have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs). However, the high computational requirements of these models limit their practical applicability especially on low-power devices. Current state-of-the-art employs approximate multipliers to address the highly increased […]
Feb, 18

Graphtoy: Fast Software Simulation of Applications for AMD’s AI Engines

This work presents Graphtoy, a coroutine-based compute graph simulator built in C++20, which can be embedded into a target application for rapid step-by-step prototyping of graphs targeting AMD’s AI Engines, as used in Versal FPGAs and Ryzen 7040 CPUs. By using a molecular docking application as a case study, we demonstrate: 1) how compute graphs […]
Feb, 18

An Evaluative Comparison of Performance Portability across GPU Programming Models

Ensuring high productivity in scientific software development necessitates developing and maintaining a single codebase that can run efficiently on a range of accelerator-based supercomputing platforms. While prior work has investigated the performance portability of a few selected proxy applications or programming models, this paper provides a comprehensive study of a range of proxy applications implemented […]
Feb, 18

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

We introduce QUICK, a group of novel optimized CUDA kernels for the efficient inference of quantized Large Language Models (LLMs). QUICK addresses the shared memory bank-conflict problem of state-of-the-art mixed precision matrix multiplication kernels. Our method interleaves the quantized weight matrices of LLMs offline to skip the shared memory write-back after the dequantization. We demonstrate […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: