28373

Posts

Jun, 25

GPU First – Execution of Legacy CPU Codes on GPUs

Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be […]
Jun, 25

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as […]
Jun, 18

Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels

Approximate computing techniques, particularly those involving reduced and mixed precision, are widely studied in literature to accelerate applications and reduce energy consumption. Although many researchers analyze the performance, accuracy loss, and energy consumption of a wide range of application domains, few evaluate approximate computing techniques in iterative applications. These applications rely on the result of […]
Jun, 18

Reducing branch divergence to speed up parallel execution of unit testing on GPUs

Software testing is an essential phase in the software development life cycle. One of the important types of software testing is unit testing and its execution is time-consuming and costly. Using parallelization to speed up the testing execution is beneficial and productive for programmers. To parallelize test execution, researchers can use GPU machines. In GPU […]
Jun, 18

Efficient GPU implementation of a class of array permutations

Optimal usage of the memory system is a key element of fast GPU algorithms. Unfortunately many common algorithms fail in this regard despite exhibiting great regularity in memory access patterns. In this paper we propose efficient kernels to permute the elements of an array, which can be used to improve the access patterns of many […]
Jun, 18

cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications

CUDA, OpenCL, and OpenACC are the primary means of writing general-purpose software for NVIDIA GPUs, all of which are subject to the same well-documented memory safety vulnerabilities currently plaguing software written in C and C++. One can argue that the GPU execution environment makes software development more error prone. Unlike C and C++, CUDA features […]
Jun, 18

EfficientBioAI: Making Bioimaging AI Models Efficient in Energy, Latency and Representation

Artificial intelligence (AI) has been widely used in bioimage image analysis nowadays, but the efficiency of AI models, like the energy consumption and latency is not ignorable due to the growing model size and complexity, as well as the fast-growing analysis needs in modern biomedical studies. Like we can compress large images for efficient storage […]
Jun, 11

GPUHarbor: Testing GPU Memory Consistency at Large

Memory consistency specifications (MCSs) are a difficult, yet critical, part of a concurrent programming framework. Existing MCS testing tools are not immediately accessible, and thus, they have only been applied to a limited number of platforms. However, in the post-Dennard scaling landscape, there has been an explosion of new architectures and frameworks, especially for GPUs. […]
Jun, 11

Program Analysis and Machine Learning based Approach to Predict Power Consumption of CUDA Kernel

General Purpose Graphics Processing Unit (GPGPU) has secured a prominent position in the High-Performance Computing (HPC) world due to its performance gain and programmability. Understanding the relationship between GPU power consumption and program features can aid developers in building energy-efficient sustainable applications. In this work, we propose a static analysis based power model built using […]
Jun, 11

SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations

The rise of exascale supercomputers has fueled competition among GPU vendors, driving lattice QCD developers to write code that supports multiple APIs. Moreover, new developments in algorithms and physics research require frequent updates to existing software. These challenges have to be balanced against constantly changing personnel. At the same time, there is a wide range […]
Jun, 11

minimap2-fpga: Integrating hardware-accelerated chaining for efficient end-to-end long-read sequence mapping

minimap2 is the gold-standard software for reference-based sequence mapping in third-generation long-read sequencing. While minimap2 is relatively fast, further speedup is desirable, especially when processing a multitude of large datasets. In this work, we present minimap2-fpga, a hardware-accelerated version of minimap2 that speeds up the mapping process by integrating an FPGA kernel optimised for chaining. […]
Jun, 11

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

General Matrix Multiplication (GEMM) is a fundamental operation widely used in scientific computations. Its performance and accuracy significantly impact the performance and accuracy of applications that depend on it. One such application is semidefinite programming (SDP), and it often requires binary128 or higher precision arithmetic to solve problems involving SDP stably. However, only some processors […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: