Posts
Oct, 13
A domain-specific language for geospatial computations on the GPU
This thesis explores how a domain-specific language (DSL) for simple geospatial operators on the GPU can be developed, and evaluates the level of functionality and performance of such a DSL. The purpose of such a DSL is to simplify implementation of geospatial operators on the GPU, in order to increase productivity and performance. An embedded […]
Oct, 6
Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric
Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a […]
Oct, 6
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision […]
Oct, 6
Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems
The throughput-centric design of GPUs poses challenges when integrating them into time-sensitive applications. Nevertheless, modern GPU architectures and software have recently evolved, making it possible to minimize overheads and interference along the critical path through advanced mechanisms, such as GPU graphs, while sustaining high throughput. However, GPU vendors provide programming ecosystems specific to their products, […]
Oct, 6
Benchmarking Thread Block Cluster
Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high-performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: […]
Oct, 6
Intel(R) SHMEM: GPU-initiated OpenSHMEM using SYCL
Modern high-end systems are increasingly becoming heterogeneous, providing users options to use general purpose Graphics Processing Units (GPU) and other accelerators for additional performance. High Performance Computing (HPC) and Artificial Intelligence (AI) applications are often carefully arranged to overlap communications and computation for increased efficiency on such platforms. This has led to efforts to extend […]
Sep, 29
HPC acceleration of large (min, +) matrix products to compute domination-type parameters in graphs
The computation of the domination-type parameters is a challenging problem in Cartesian product graphs. We present an algorithmic method to compute the 2-domination number of the Cartesian product of a path with small order and any cycle, involving the (min,+) matrix product. We establish some theoretical results that provide the algorithms necessary to compute that […]
Sep, 29
miniLB: A Performance Portability Study of Lattice-Boltzmann Simulations
The Lattice Boltzmann Method (LBM) is a computational technique of Computational Fluid Dynamics (CFD) that has gained popularity due to its high parallelism and ability to handle complex geometries with minimal effort. Although LBM frameworks are increasingly important in various industries and research fields, their complexity makes them difficult to modify and can lead to […]
Sep, 29
Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL
Field-programmable gate array (FPGA) vendors provide high-level synthesis (HLS) compilers with accompanying OpenCL runtimes to enable easier use of their devices by non-hardware experts. However, the current runtimes provided by the vendors are not OpenCL-compliant, limiting the application portability and making it difficult to integrate FPGA devices in heterogeneous computing platforms. We propose an automated […]
Sep, 29
Automatic Generation of OpenCL Code through Polyhedral Compilation with LLM
In recent years, a multitude of AI solutions has emerged to facilitate code generation, commonly known as Language Model-based Programming (LLM). These tools empower programmers to automate their work. Automatic programming also falls within the domain of optimizing compilers, primarily based on the polyhedral model, which processes loop nests concentrating most computations. This article focuses […]
Sep, 29
OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs
GPUs are the heart of the latest generations of supercomputers. We efficiently accelerate a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses ‘gang vector’ and ‘collapse’. Further speedups of six and ten times are achieved by packing user-defined types into coalesced multidimensional arrays […]
Sep, 22
Collection skeletons: declarative abstractions for data collections
Modern programming languages provide programmers with rich abstractions for data collections as part of their standard libraries, e.g., Containers in the C++ STL, the Java Collections Framework, or the Scala Collections API. Typically, these collections frameworks are organised as hierarchies that provide programmers with common abstract data types (ADTs) like lists, queues, and stacks. While […]