high performance computing on graphics processing units: hgpu.org

Posts

Feb, 2

GPU-accelerated dynamic programming for join-order optimization

Relational databases need to select efficient join orders, as inefficient join orders can increase the query execution time by several orders of magnitude. To select efficient join orders, relational databases can apply an exhaustive search using dynamic programming. Unfortunately, the applicability of sequential dynamic programming variants is limited to simple queries due to the exhaustive […]

OpenCL

Feb, 2

Non-Determinism in TensorFlow ResNets

We show that the stochasticity in training ResNets for image classification on GPUs in TensorFlow is dominated by the non-determinism from GPUs, rather than by the initialisation of the weights and biases of the network or by the sequence of minibatches given. The standard deviation of test set accuracy is 0.02 with fixed seeds, compared […]

Feb, 2

Optimization of a discontinuous Galerkin solver with OpenCL and StarPU

Since the recent advance in microprocessor design, the optimization of computing software becomes more and more technical. One of the difficulties is to transform sequential algorithms into parallel ones. A possible solution is the task-based design. In this approach, it is possible to describe the parallelization possibilities of the algorithm automatically. The task-based design is […]

OpenCL

Feb, 2

Noise Removal from Remote Sensed Images by NonLocal Means with OpenCL Algorithm

We introduce a multi-platform portable implementation of the NonLocal Means methodology aimed at noise removal from remotely sensed images. It is particularly suited for hyperspectral sensors for which real-time applications are not possible with only CPU based algorithms. In the last decades computational devices have usually been a compound of cross-vendor sets of specifications (heterogeneous […]

OpenCL

Feb, 2

Interoperable GPU Kernels as Latency Improver for MEC

Mixed reality (MR) applications are expected to become common when 5G goes mainstream. However, the latency requirements are challenging to meet due to the resources required by video-based remoting of graphics, that is, decoding video codecs. We propose an approach towards tackling this challenge: a client-server implementation for transacting intermediate representation (IR) between a mobile […]

Jan, 26

Using Parallel Programming Models for Automotive Workloads on Heterogeneous Systems – a Case Study

Due to the ever-increasing computational demand of automotive applications, and in particular autonomous driving functionalities, the automotive industry and supply vendors are starting to adopt parallel and heterogeneous embedded platforms for their products. However, C and C++, the currently dominating programming languages in this industry, do not provide sufficient mechanisms to target such platforms. Established […]

CUDA

•

OpenCL

Jan, 26

Hardware/Software Co-Design for Data-Intensive Genomics Workloads

Since the last decade, the main components of computer systems have been evolving, diversifying, to overcome their physical limits and to minimize their energy footprint. Hardware specialization and heterogeneity have become key to design more efficient systems and tackle ever-important problems with ever-larger volumes of data. However, to fully take advantage of the new hardware, […]

OpenCL

Jan, 26

Automatically Harnessing Sparse Acceleration

Sparse linear algebra is central to many scientific programs, yet compilers fail to optimize it well. High-performance libraries are available, but adoption costs are significant. Moreover, libraries tie programs into vendor-specific software and hardware ecosystems, creating non-portable code. In this paper, we develop a new approach based on our specification Language for implementers of Linear […]

CUDA

•

OpenCL

Jan, 26

Efficient Radial Pattern Keyword Search on Knowledge Graphs in Parallel

Recently, keyword search on Knowledge Graphs (KGs) becomes popular. Typical keyword search approaches aim at finding a concise subgraph from a KG, which can reflect a close relationship among all input keywords. The connection paths between keywords are selected in a way that leads to a result subgraph with a better semantic score. However, such […]

CUDA

Jan, 26

An Image Enhancing Pattern-based Sparsity for Real-time Inference on Mobile Devices

Weight pruning has been widely acknowledged as a straightforward and effective method to eliminate redundancy in Deep Neural Networks (DNN), thereby achieving acceleration on various platforms. However, most of the pruning techniques are essentially trade-offs between model accuracy and regularity which lead to impaired inference accuracy and limited on-device acceleration performance. To solve the problem, […]

CUDA

•

OpenCL

Jan, 19

On Demand Solid Texture Synthesis Using Deep 3D Networks

This paper describes a novel approach for on demand volumetric texture synthesis based on a deep learning framework that allows for the generation of high quality 3D data at interactive rates. Based on a few example images of textures, a generative network is trained to synthesize coherent portions of solid textures of arbitrary sizes that […]

OpenGL

Jan, 19

md_poly: A Performance-Portable Polyhedral Compiler Based on Multi-Dimensional Homomorphisms

Polyhedral compilers automatically parallelize sequential programs for multi- and many-core architectures, such as CPU and GPU. However, parallel code generated by state-ofthe-art polyhedral compilers often lacks performance portability, because the existing compilers are usually optimized toward only a single particular architecture (e.g., GPU). Moreover, even on their target architecture, polyhedral compilers sometimes fail to reach […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

GPU-accelerated dynamic programming for join-order optimization

Non-Determinism in TensorFlow ResNets

Optimization of a discontinuous Galerkin solver with OpenCL and StarPU

Noise Removal from Remote Sensed Images by NonLocal Means with OpenCL Algorithm

Interoperable GPU Kernels as Latency Improver for MEC

Using Parallel Programming Models for Automotive Workloads on Heterogeneous Systems – a Case Study

Hardware/Software Co-Design for Data-Intensive Genomics Workloads

Automatically Harnessing Sparse Acceleration

Efficient Radial Pattern Keyword Search on Knowledge Graphs in Parallel

An Image Enhancing Pattern-based Sparsity for Real-time Inference on Mobile Devices

On Demand Solid Texture Synthesis Using Deep 3D Networks

md_poly: A Performance-Portable Polyhedral Compiler Based on Multi-Dimensional Homomorphisms

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)