Posts
Oct, 13
Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation
In recent years, the need for high-performance computing solutions has increased due to the growing complexity of computational tasks. The use of parallel processing techniques has become essential to address this demand. In this study, an Open Computing Language (OpenCL)-based parallelization algorithm is implemented for the Constant Neighbors (CNe) and CNe with Predictor–Corrector (CpC) numerical […]
Oct, 6
Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric
Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a […]
Oct, 6
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision […]
Oct, 6
Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems
The throughput-centric design of GPUs poses challenges when integrating them into time-sensitive applications. Nevertheless, modern GPU architectures and software have recently evolved, making it possible to minimize overheads and interference along the critical path through advanced mechanisms, such as GPU graphs, while sustaining high throughput. However, GPU vendors provide programming ecosystems specific to their products, […]
Oct, 6
Benchmarking Thread Block Cluster
Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high-performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: […]
Oct, 6
Intel(R) SHMEM: GPU-initiated OpenSHMEM using SYCL
Modern high-end systems are increasingly becoming heterogeneous, providing users options to use general purpose Graphics Processing Units (GPU) and other accelerators for additional performance. High Performance Computing (HPC) and Artificial Intelligence (AI) applications are often carefully arranged to overlap communications and computation for increased efficiency on such platforms. This has led to efforts to extend […]
Sep, 29
Automatic Generation of OpenCL Code through Polyhedral Compilation with LLM
In recent years, a multitude of AI solutions has emerged to facilitate code generation, commonly known as Language Model-based Programming (LLM). These tools empower programmers to automate their work. Automatic programming also falls within the domain of optimizing compilers, primarily based on the polyhedral model, which processes loop nests concentrating most computations. This article focuses […]
Sep, 29
HPC acceleration of large (min, +) matrix products to compute domination-type parameters in graphs
The computation of the domination-type parameters is a challenging problem in Cartesian product graphs. We present an algorithmic method to compute the 2-domination number of the Cartesian product of a path with small order and any cycle, involving the (min,+) matrix product. We establish some theoretical results that provide the algorithms necessary to compute that […]
Sep, 29
miniLB: A Performance Portability Study of Lattice-Boltzmann Simulations
The Lattice Boltzmann Method (LBM) is a computational technique of Computational Fluid Dynamics (CFD) that has gained popularity due to its high parallelism and ability to handle complex geometries with minimal effort. Although LBM frameworks are increasingly important in various industries and research fields, their complexity makes them difficult to modify and can lead to […]
Sep, 29
Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL
Field-programmable gate array (FPGA) vendors provide high-level synthesis (HLS) compilers with accompanying OpenCL runtimes to enable easier use of their devices by non-hardware experts. However, the current runtimes provided by the vendors are not OpenCL-compliant, limiting the application portability and making it difficult to integrate FPGA devices in heterogeneous computing platforms. We propose an automated […]
Sep, 29
OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs
GPUs are the heart of the latest generations of supercomputers. We efficiently accelerate a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses ‘gang vector’ and ‘collapse’. Further speedups of six and ten times are achieved by packing user-defined types into coalesced multidimensional arrays […]
Sep, 22
The Landscape of GPU-Centric Communication
In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now […]