Posts
May, 14
TorchBench: Benchmarking PyTorch with High API Surface Coverage
Deep learning (DL) has been a revolutionary technique in various domains. To facilitate the model development and deployment, many deep learning frameworks are proposed, among which PyTorch is one of the most popular solutions. The performance of ecosystem around PyTorch is critically important, which saves the costs of training models and reduces the response time […]
May, 14
Performance Optimization using Multimodal Modeling and Heterogeneous GNN
Growing heterogeneity and configurability in HPC architectures has made auto-tuning applications and runtime parameters on these systems very complex. Users are presented with a multitude of options to configure parameters. In addition to application specific solutions, a common approach is to use general purpose search strategies, which often might not identify the best configurations or […]
May, 14
Descend: A Safe GPU Systems Programming Language
Graphics Processing Units (GPU) offer tremendous computational power by following a throughput oriented computing paradigm where many thousand computational units operate in parallel. Programming this massively parallel hardware is challenging. Programmers must correctly and efficiently coordinate thousands of threads and their accesses to various shared memory spaces. Existing mainstream GPU programming languages, such as CUDA […]
May, 14
Prediction of Performance and Power Consumption of GPGPU Applications
Graphics Processing Units (GPUs) have become an integral part of High-Performance Computing to achieve an Exascale performance. The main goal of application developers of GPU is to tune their code extensively to obtain optimal performance, making efficient use of different resources available. While extracting optimal performance of applications on an HPC infrastructure, developers should also […]
May, 7
Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs
In this thesis, we present KLARAPTOR (Kernel LAunch parameters RAtional Program estimaTOR), a freely available tool to dynamically determine the values of kernel launch parameters of a CUDA kernel. We describe a technique for building a helper program, at the compile-time of a CUDA program, that is used at run-time to determine near-optimal kernel launch […]
May, 7
Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads
Shared memory heterogeneous systems are now mainstream, with nearly every mobile phone and tablet containing integrated processing units. However, developing applications for such devices is difficult as workloads must be decomposed across different processing units, and the decomposition must be flexible to account for the growing diversity of devices, each with different relative processing unit […]
May, 7
Optimizing Deep Learning Models For Raspberry Pi
Deep learning models have become increasingly popular for a wide range of applications, including computer vision, natural language processing, and speech recognition. However, these models typically require large amounts of computational resources, making them challenging to run on low-power devices such as the Raspberry Pi. One approach to addressing this challenge is to use pruning […]
May, 7
Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While researchers often strive for faster performance by using large compute platforms, the increased scale of these systems can raise concerns about hardware and […]
May, 7
FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs
Today’s large-scale scientific applications running on high-performance computing (HPC) systems generate vast data volumes. Thus, data compression is becoming a critical technique to mitigate the storage burden and data-movement cost. However, existing lossy compressors for scientific data cannot achieve a high compression ratio and throughput simultaneously, hindering their adoption in many applications requiring fast compression, […]
Apr, 23
Simple and efficient GPU accelerated topology optimisation: Codes and applications
This work presents topology optimisation implementations for linear elastic compliance minimisation in three dimensions, accelerated using Graphics Processing Units (GPUs). Three different open-source implementations are presented for linear problems. Two implementations use GPU acceleration, based on either OpenMP 4.5 or the Futhark language to implement the hardware acceleration. Both GPU implementations are based on high […]
Apr, 23
Thread-safe lattice Boltzmann for high-performance computing on GPUs
We present thread-safe, highly-optimized lattice Boltzmann implementations, specifically aimed at exploiting the high memory bandwidth of GPU-based architectures. At variance with standard approaches to LB coding, the proposed strategy, based on the reconstruction of the post-collision distribution via Hermite projection, enforces data locality and avoids the onset of memory dependencies, which may arise during the […]
Apr, 23
Fuzzing Loop Optimizations in Compilers for C++ and Data-Parallel Languages
Compilers are part of the foundation upon which software systems are built; they need to be as correct as possible. This paper is about stress-testing loop optimizers; it presents a major reimplementation of Yet Another Random Program Generator (YARPGen), an open-source generative compiler fuzzer. This new version has found 122 bugs, both in compilers for […]