28397

Posts

Jul, 2

Out-of-the-box library support for DBMS operations on GPUs

GPU accelerated query execution is still ongoing research in the database community, as GPUs continue to be heterogeneous in their architectures varying their capabilities (e.g., their newest selling point: tensor cores). Hence, many researchers come up with optimal operator implementations for a specific device generation involving tedious operator tuning by hand. Alternatively, there is a […]
Jul, 2

SYCL compute kernels for ExaHyPE

We discuss three SYCL realisations of a simple Finite Volume scheme over multiple Cartesian patches. The realisation flavours differ in the way how they map the compute steps onto loops and tasks: We compare an implementation which is exclusively using a cascade of for-loops to a version which uses nested parallelism, and finally benchmark these […]
Jul, 2

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

The popularity of machine learning (ML) workloads have made GPU instance offerings ubiquitous in the cloud, introducing new challenges in managing, profiling, and optimizing GPU workloads. Cloud providers assign passthrough GPUs directly to virtual machines (VMs) for high performance, but doing so renders VM migration non-functional, limiting cloud operator ability to manage hardware resources. Existing […]
Jun, 25

Deep Language Models for Software Testing and Optimisation

Developing software is difficult. A challenging part of production development is ensuring programs are correct and fast, two properties satisfied with software testing and optimisation. While both tasks still rely on manual effort and expertise, the recent surge in software applications has led them to become tedious and time-consuming. Under this fast-pace environment, manual testing […]
Jun, 25

Compilation and Design Space Exploration of Dataflow Programs for Heterogeneous CPU-GPU Platforms

Today’s continued increase in demand for processing power, despite the slowdown of Moore’s law, has led to an increase in processor count, which has resulted in energy consumption and distribution problems. To address this, there is a growing trend toward creating more complex heterogeneous systems where multicore, many-core, GPU, FPGA, and DSPs are combined in […]
Jun, 25

DGEMM on Integer Matrix Multiplication Unit

Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is […]
Jun, 25

GPU First – Execution of Legacy CPU Codes on GPUs

Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be […]
Jun, 25

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as […]
Jun, 18

Reducing branch divergence to speed up parallel execution of unit testing on GPUs

Software testing is an essential phase in the software development life cycle. One of the important types of software testing is unit testing and its execution is time-consuming and costly. Using parallelization to speed up the testing execution is beneficial and productive for programmers. To parallelize test execution, researchers can use GPU machines. In GPU […]
Jun, 18

Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels

Approximate computing techniques, particularly those involving reduced and mixed precision, are widely studied in literature to accelerate applications and reduce energy consumption. Although many researchers analyze the performance, accuracy loss, and energy consumption of a wide range of application domains, few evaluate approximate computing techniques in iterative applications. These applications rely on the result of […]
Jun, 18

Efficient GPU implementation of a class of array permutations

Optimal usage of the memory system is a key element of fast GPU algorithms. Unfortunately many common algorithms fail in this regard despite exhibiting great regularity in memory access patterns. In this paper we propose efficient kernels to permute the elements of an array, which can be used to improve the access patterns of many […]
Jun, 18

cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications

CUDA, OpenCL, and OpenACC are the primary means of writing general-purpose software for NVIDIA GPUs, all of which are subject to the same well-documented memory safety vulnerabilities currently plaguing software written in C and C++. One can argue that the GPU execution environment makes software development more error prone. Unlike C and C++, CUDA features […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: