Posts
Oct, 22
LeXInt: GPU-accelerated Exponential Integrators package
We present an open-source CUDA-based package that consists of a compilation of exponential integrators where the action of the matrix exponential or the φl functions on a vector is approximated using the method of polynomial interpolation at Leja points. Using a couple of test examples on an NVIDIA A100 GPU, we show that one can […]
Oct, 15
OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials
Machine learning plays an important and growing role in molecular simulation. The newest version of the OpenMM molecular dynamics toolkit introduces new features to support the use of machine learning potentials. Arbitrary PyTorch models can be added to a simulation and used to compute forces and energy. A higher-level interface allows users to easily model […]
Oct, 15
Reverse-Mode AD of Reduce-by-Index and Scan in Futhark
We present and evaluate the Futhark implementation of reverse-mode automatic differentiation (AD) for the basic blocks of parallel programming: reduce, prefix sum (scan), and reduce by index. We first present derivations of general-case algorithms and then discuss several specializations that result in efficient differentiation of most cases of practical interest. We report an experiment that […]
Oct, 15
Strega: An HTTP Server for FPGAs
The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to […]
Oct, 15
Comparison of different n-body algorithms on various hardware platforms using SYCL
The n-body problem has various applications in different fields of science such as astrophysics, where it describes the problem of calculating the movements of n different bodies which all interact with each other over time. There exist different algorithms that solve the n-body problem, for example, the naive approach and the Barnes-Hut algorithm. Since applications […]
Oct, 15
Open SYCL on heterogeneous GPU systems: A case of study
Computational platforms for high-performance scientific applications are becoming more heterogenous, including hardware accelerators such as multiple GPUs. Applications in a wide variety of scientific fields require an efficient and careful management of the computational resources of this type of hardware to obtain the best possible performance. However, there are currently different GPU vendors, architectures and […]
Oct, 8
Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload
GPUs are well-known for their remarkable ability to accelerate computations through massive parallelism. However, offloading computations to GPUs necessitates manual identification of code regions that should be executed on the device, memory that needs to be transferred, and synchronization to be handled. Recent work has leveraged the portable target offloading interface provided by LLVM/OpenMP, taking […]
Oct, 8
Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library
As supercomputers become larger with powerful Graphics Processing Unit (GPU), traditional direct eigensolvers struggle to keep up with the hardware evolution and scale efficiently due to communication and synchronization demands. Conversely, subspace eigensolvers, like the Chebyshev Accelerated Subspace Eigensolver (ChASE), have a simpler structure and can overcome communication and synchronization bottlenecks. ChASE is a modern […]
Oct, 8
Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow
High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in […]
Oct, 8
Impacts of Parallel Programming on Limited-Resource Hardware
Limited resource hardware devices are more affordable and energy efficient than high-end hardware. Despite their reduced size, these devices are increasingly complex, with many now featuring multiple processing cores, GPGPU accelerators, and larger RAM capacity. To fully utilize their computational capacity, software developers must exploit parallelism, but this adds an extra layer of complexity because […]
Oct, 8
Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang
MLIR has become popular since it was open sourced in 2019. A sub-project of LLVM, the flexibility provided by MLIR to represent Intermediate Representations (IR) as dialects at different abstraction levels, to mix these, and to leverage transformations between dialects provides opportunities for automated program optimisation and parallelisation. In addition to general purpose compilers built […]
Oct, 1
Memory Efficient Mixed-Precision Optimizers
Traditional optimization methods rely on the use of single-precision floating point arithmetic, which can be costly in terms of memory size and computing power. However, mixed precision optimization techniques leverage the use of both single and half-precision floating point arithmetic to reduce memory requirements while maintaining model accuracy. We provide here an algorithm to further […]