29452

Posts

Oct, 13

Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR

In this thesis we show the benefits of the novel MLIR compiler technology to the generation of code from a DSL, namely EasyML used in openCARP, a widely used simulator in the cardiac electrophysiology community. Building on an existing work we deeply modified openCARP’s native code generator to enable efficient vectorized CPU and GPU code […]
Oct, 13

Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing

This book presents a comprehensive exploration of GPGPU (General Purpose Graphics Processing Unit) and its applications in deep learning and machine learning. It focuses on how parallel computing, particularly through the use of CUDA (Compute Unified Device Architecture), can unlock unprecedented computational power for complex tasks. The book provides detailed discussions on CPU and GPU […]
Oct, 13

Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs

GPUs are progressively being integrated into modern society, playing a pivotal role in Artificial Intelligence and High-Performance Computing. Programmers need a deep understanding of the GPU programming model to avoid subtle data-races in their codes. Static verification that is sound and incomplete can guarantee data-race freedom, but the alarms it raises may be spurious and […]
Oct, 13

A domain-specific language for geospatial computations on the GPU

This thesis explores how a domain-specific language (DSL) for simple geospatial operators on the GPU can be developed, and evaluates the level of functionality and performance of such a DSL. The purpose of such a DSL is to simplify implementation of geospatial operators on the GPU, in order to increase productivity and performance. An embedded […]
Oct, 13

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

In recent years, the need for high-performance computing solutions has increased due to the growing complexity of computational tasks. The use of parallel processing techniques has become essential to address this demand. In this study, an Open Computing Language (OpenCL)-based parallelization algorithm is implemented for the Constant Neighbors (CNe) and CNe with Predictor–Corrector (CpC) numerical […]
Oct, 6

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a […]
Oct, 6

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision […]
Oct, 6

Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems

The throughput-centric design of GPUs poses challenges when integrating them into time-sensitive applications. Nevertheless, modern GPU architectures and software have recently evolved, making it possible to minimize overheads and interference along the critical path through advanced mechanisms, such as GPU graphs, while sustaining high throughput. However, GPU vendors provide programming ecosystems specific to their products, […]
Oct, 6

Benchmarking Thread Block Cluster

Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high-performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: […]
Oct, 6

Intel(R) SHMEM: GPU-initiated OpenSHMEM using SYCL

Modern high-end systems are increasingly becoming heterogeneous, providing users options to use general purpose Graphics Processing Units (GPU) and other accelerators for additional performance. High Performance Computing (HPC) and Artificial Intelligence (AI) applications are often carefully arranged to overlap communications and computation for increased efficiency on such platforms. This has led to efforts to extend […]
Sep, 29

miniLB: A Performance Portability Study of Lattice-Boltzmann Simulations

The Lattice Boltzmann Method (LBM) is a computational technique of Computational Fluid Dynamics (CFD) that has gained popularity due to its high parallelism and ability to handle complex geometries with minimal effort. Although LBM frameworks are increasingly important in various industries and research fields, their complexity makes them difficult to modify and can lead to […]
Sep, 29

Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL

Field-programmable gate array (FPGA) vendors provide high-level synthesis (HLS) compilers with accompanying OpenCL runtimes to enable easier use of their devices by non-hardware experts. However, the current runtimes provided by the vendors are not OpenCL-compliant, limiting the application portability and making it difficult to integrate FPGA devices in heterogeneous computing platforms. We propose an automated […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: