high performance computing on graphics processing units: hgpu.org

Posts

Oct, 13

.NET High Performance Computing

Graphics Processing Units (GPUs) have been extensively applied in the High Performance Computing (HPC) community. HPC applications require additional special programming environments to improve the utilization of GPUs, for example, NVIDIA’s CUDA and Khronos group’s OpenCL. This thesis will introduce a preprocessor framework called HPC.NET, which is deployed on the Microsoft .NET platform to meet […]

CUDA

Oct, 13

FPGA-GPU-CPU Heterogenous Architecture for Real-time Cardiac Physiological Optical Mapping

Real-time optical mapping technology is a technique that can be used in cardiac disease study and treatment technology development to obtain accurate and comprehensive electrical activity over the entire heart. It provides a dense spatial electrophysiology. Each pixel essentially plays the role of a probe on that location of the heart. However, the high throughput […]

CUDA

Oct, 13

Parallel H-Tree Based Data Cubing on Graphics Processors

Graphics processing units (GPUs) have an SIMD architecture and have been widely used recently as powerful general-purpose co-processors for the CPU. In this paper, we investigate efficient GPU-based data cubing because the most frequent operation in data cube computation is aggregation, which is an expensive operation well suited for SIMD parallel processors. H-tree is a […]

CUDA

Oct, 13

Accelerating Cost Aggregation for Real-Time Stereo Matching

Real-time stereo matching, which is important in many applications like self-driving cars and 3-D scene reconstruction, requires large computation capability and high memory bandwidth. The most time-consuming part of stereomatching algorithms is the aggregation of information (i.e. costs) over local image regions. In this paper, we present a generic representation and suitable implementations for three […]

OpenCL

Oct, 13

Programming NVIDIA cards by means of transitive closure based parallelization algorithms

Massively parallel processing is a type of computing that uses many separate CPUs or GPUs running in parallel to execute a single program. Because most computations are contained in program loops, automatic extraction of parallelism available in loops is extremely important for many-core systems. In this paper, we study speed-up and scalability of parallel code […]

CUDA

Oct, 13

Extendable Pattern-Oriented Optimization Directives (extended version)

Algorithm-specific, i.e., semantic-specific optimizations have been observed to bring significant performance gains, especially for a diverse set of multi/many-core architectures. However, current programming models and compiler technologies for the state-of-the-art architectures do not exploit well these performance opportunities. In this paper, we propose a pattern-making methodology that enables algorithm-specific optimizations to be encapsulated into "optimization […]

Oct, 13

Mesh Independent Loop Fusion for Unstructured Mesh Applications

Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden […]

CUDA

Oct, 13

GPU-Based Local-Dimming for Power Efficient Imaging

This paper describes a local dimming method for reducing the power consumption of LCD monitors. Reducing this load is of ever growing importance as it is getting the dominant power consumer of mobile computing. As a side effect, our method does not only significantly reduce the power consumption but also improves the visual quality (see […]

CUDA

Oct, 13

Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs

Automatically parallelizing loop nests into CUDA kernels must exploit the full potential of GPUs to obtain high performance. One state-of-the-art approach makes use of the polyhedral model to extract parallelism from a loop nest by applying a sequence of affine transformations to the loop nest. However, how to automate this process to exploit both intraand […]

CUDA

Oct, 9

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Image segmentation is a very important step in many GIS applications. Mean shift is an advanced and versatile technique for clustering-based segmentation, and is favored in many cases because it is non-parametric. However, mean shift is very computationally intensive compared with other simple methods such as k-means. In this work, we present a hybrid design […]

CUDA

Oct, 9

Applying Genetic Algorithms to Tune Heterogeneous Platform Configurations

Present need to move towards heterogeneous architectures has been well established. This has increased the importance of parallelization of software to achieve good performance. Use of mixed architectures exponentially increases the need of the programmer to understand the intricacies of the underlying hardware to achieve optimal speedup. Obtaining optimal performance on one such architecture is […]

OpenCL

Oct, 9

A PCG Implementation of an Elliptic Kernel in an Ocean Global Circulation Model Based on GPU Libraries

In this paper an inverse preconditioner for the numerical solution of an elliptic Laplace prob- lem of a global circulation ocean model is presented. The inverse preconditiong technique is adopted in order to efficiently compute the numerical solution of the elliptic kernel by using the Conjugate Gradient (CG) method. We show how the performance and […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

.NET High Performance Computing

FPGA-GPU-CPU Heterogenous Architecture for Real-time Cardiac Physiological Optical Mapping

Parallel H-Tree Based Data Cubing on Graphics Processors

Accelerating Cost Aggregation for Real-Time Stereo Matching

Programming NVIDIA cards by means of transitive closure based parallelization algorithms

Extendable Pattern-Oriented Optimization Directives (extended version)

Mesh Independent Loop Fusion for Unstructured Mesh Applications

GPU-Based Local-Dimming for Power Efficient Imaging

Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Applying Genetic Algorithms to Tune Heterogeneous Platform Configurations

A PCG Implementation of an Elliptic Kernel in an Ocean Global Circulation Model Based on GPU Libraries

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)