28667

Posts

Oct, 8

Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library

As supercomputers become larger with powerful Graphics Processing Unit (GPU), traditional direct eigensolvers struggle to keep up with the hardware evolution and scale efficiently due to communication and synchronization demands. Conversely, subspace eigensolvers, like the Chebyshev Accelerated Subspace Eigensolver (ChASE), have a simpler structure and can overcome communication and synchronization bottlenecks. ChASE is a modern […]
Oct, 8

Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow

High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in […]
Oct, 8

Impacts of Parallel Programming on Limited-Resource Hardware

Limited resource hardware devices are more affordable and energy efficient than high-end hardware. Despite their reduced size, these devices are increasingly complex, with many now featuring multiple processing cores, GPGPU accelerators, and larger RAM capacity. To fully utilize their computational capacity, software developers must exploit parallelism, but this adds an extra layer of complexity because […]
Oct, 1

Memory Efficient Mixed-Precision Optimizers

Traditional optimization methods rely on the use of single-precision floating point arithmetic, which can be costly in terms of memory size and computing power. However, mixed precision optimization techniques leverage the use of both single and half-precision floating point arithmetic to reduce memory requirements while maintaining model accuracy. We provide here an algorithm to further […]
Oct, 1

Experience Migrating OpenCL to SYCL: A Case Study on Searches for Potential Off-Target Sites of Cas9 RNA-Guided Endonucleases on AMD GPUs

Cas-OFFinder is a popular application written in OpenCL for searching potential off-target sites in parallel on a GPU. In this work, we describe our experience of migrating the application from OpenCL to SYCL. Evaluating the performance of the OpenCL and SYCL application using human genome sequences shows that the SYCL program could achieve performance portability […]
Oct, 1

OpenMP Kernel Language Extensions for Performance Portable GPU Codes

In contemporary high-performance computing architectures, the integration of GPU accelerators has become increasingly prevalent. To harness the full potential of these accelerators, developers often resort to vendor-specific kernel languages, such as CUDA. While this approach ensures optimal efficiency, it inherently compromises portability and engenders vendor dependency. Existing portable programming models, such as OpenMP, while promising, […]
Oct, 1

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Deep Learning has achieved state-of-the-art performance in several artificial intelligence tasks like object recognition, speech recognition, machine translation, and summarization. Deep learning is a subset of machine learning that learns multiple levels of data representation using Neural Networks (NNs). The rise of deep learning can be attributed to the presence of large datasets and computation […]
Oct, 1

Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation

The Standard Portable Intermediate Representation (SPIR-V) is a low-level binary format designed for representing shaders and compute kernels that can be consumed by OpenCL for computing kernels, and Vulkan for graphics rendering. As a binary representation, SPIR-V is meant to be used by compilers and runtime systems, and is usually performed by C/C++ programs and […]
Sep, 24

Julia as a unifying end-to-end workflow language on the Frontier exascale system

We evaluate using Julia as a single language and ecosystem paradigm powered by LLVM to develop workflow components for high-performance computing. We run a Gray-Scott, 2-variable diffusion-reaction application using a memory-bound, 7-point stencil kernel on Frontier, the US Department of Energy’s first exascale supercomputer. We evaluate the feasibility, performance, scaling, and trade-offs of (i) the […]
Sep, 24

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

In this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three major vendors; we evaluate performance on the Intel(R) Data Center GPU Max 1100, […]
Sep, 24

Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs

The heterogeneous computing paradigm has led to the need for portable and efficient programming solutions that can leverage the capabilities of various hardware devices, such as NVIDIA, Intel, and AMD GPUs. This study evaluates the portability and performance of the SYCL and CUDA languages for one fundamental bioinformatics application (Smith-Waterman protein database search) across different […]
Sep, 24

Compressed Real Numbers for AI: a case-study using a RISC-V CPU

As recently demonstrated, Deep Neural Networks (DNN), usually trained using single precision IEEE 754 floating point numbers (binary32), can also work using lower precision. Therefore, 16-bit and 8-bit compressed format have attracted considerable attention. In this paper, we focused on two families of formats that have already achieved interesting results in compressing binary32 numbers in […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: