high performance computing on graphics processing units: hgpu.org

Posts

Feb, 21

High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures

Compressive sensing (CS) describes how sparse signals can be accurately reconstructed from many fewer samples than required by the Nyquist criterion. Since MRI scan duration is proportional to the number of acquired samples, CS has been gaining significant attention in MRI. However, the computationally intensive nature of CS reconstructions has precluded their use in routine […]

CUDA

Feb, 21

Acceleration of Composite Order Bilinear Pairing on Graphics Hardware

Recently, composite-order bilinear pairing has been shown to be useful in many cryptographic constructions. However, it is time-costly to evaluate. This is because the composite order should be at least 1024bit and, hence, the elliptic curve group order $n$ and base field become too large, rendering the bilinear pairing algorithm itself too slow to be […]

CUDA

Feb, 21

GPGPU Processing in CUDA Architecture

The future of computation is the Graphical Processing Unit, i.e. the GPU. The promise that the graphics cards have shown in the field of image processing and accelerated rendering of 3D scenes, and the computational capability that these GPUs possess, they are developing into great parallel computing units. It is quite simple to program a […]

CUDA

•

OpenCL

Feb, 20

Implementation of LTE Mini receiver on GPUs

Long Term Evolution (LTE) is the latest standard for cellular mobile communication. To fully exploit the available spectrum, LTE utilizes feedback. Since the radio channel is varying in time, the feedback calculation is latency sensitive. In our upcoming LTE measurement with the Vienna Multiple Input Multiple Output (MIMO) Testbed, a low latency feedback calculation is […]

CUDA

Feb, 20

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops […]

CUDA

Feb, 20

A Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators

The shift toward parallel computing has resulted into a growing interest in computing systems with heterogeneous processing modules. Reconfigurable devices are often employed in such heterogeneous systems due to their low power and parallel processing benefits. An important issue in the programmability of these systems is the need for a single programming interface. Recent works […]

CUDA

Feb, 20

Introducing ‘Bones’: A Parallelizing Source-to-Source Compiler Based on Algorithmic Skeletons

Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from programmers. A number of parallelizing source-to-source compilers have recently been […]

CUDA

•

OpenCL

Feb, 20

Review: Kd-tree Traversal Algorithms for Ray Tracing

In this paper we review the traversal algorithms for kd-trees for ray tracing. Ordinary traversal algorithms such as sequential, recursive, and those with neighbour-links have different limitations, which led to several new developments within the last decade. We describe algorithms exploiting ray coherence and algorithms designed with specific hardware architecture limitations such as memory latency […]

Feb, 18

GPU Parallel Statistical and Cube Test Analysis of the SHA-3 Finalist Candidate Hash Functions

The 256-bit versions of the SHA-3 finalist candidate hash functions – BLAKE, Grostl, JH, Keccak, and Skein – were subjected to statistical tests to attempt to disprove the hypothesis that the output bits are uniformly distributed, independent, binary random variables. The hash functions were also subjected to cube tests to attempt to disprove the hypothesis […]

CUDA

Feb, 18

Exploiting Segmentation for Robust 3D Object Matching

While Iterative Closest Point (ICP) algorithms have been successful at aligning 3D point clouds, they do not take into account constraints arising from sensor viewpoints. More recent beam-based models take into account sensor noise and viewpoint, but problems still remain. In particular, good optimization strategies are still lacking for the beam-based model. In situations of […]

CUDA

•

OpenGL

Feb, 18

Performance Portability with the Chapel Language

It has been widely shown that high-throughput computing architectures such as GPUs offer large performance gains compared with their traditional low-latency counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, loss of portability across different architectures, explicit data movement, and challenges […]

CUDA

Feb, 18

Cone-beam Computed tomography image reconstruction based on GPU

As so long, three-dimensional cone-beam computed tomography(CBCT) image reconstruction is a hot issue in medical imaging field. Often the computation operation of CBCT reconstruction is huge and the reconstruction time is long. Now with the development of computer technology, especially the rapid development of Graphics Processing Unit (GPU) based general-purpose computing technology enables fast CBCT […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures

Acceleration of Composite Order Bilinear Pairing on Graphics Hardware

GPGPU Processing in CUDA Architecture

Implementation of LTE Mini receiver on GPUs

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

A Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators

Introducing ‘Bones’: A Parallelizing Source-to-Source Compiler Based on Algorithmic Skeletons

Review: Kd-tree Traversal Algorithms for Ray Tracing

GPU Parallel Statistical and Cube Test Analysis of the SHA-3 Finalist Candidate Hash Functions

Exploiting Segmentation for Robust 3D Object Matching

Performance Portability with the Chapel Language

Cone-beam Computed tomography image reconstruction based on GPU

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)