high performance computing on graphics processing units: hgpu.org

Posts

Oct, 4

Performance Portability Evaluation for OpenACC on Intel Knights Corner and Nvidia Kepler

OpenACC is a programming standard designed to simplify heterogeneous parallel programming by using directives. Since OpenACC can generate OpenCL and CUDA code, meanwhile running OpenCL on Intel Knight Corner is supported by CAPS HMPP compiler, it is attractive to using OpenACC on hardwares with different underlying microarchitectures. This paper studies how realistic it is to […]

Oct, 4

Facial Expression Recognition – Review

Expression recognition (happy, sad, disgust, surprise, angry, fear expressions) is application of advanced object detection, pattern recognition and classification task. Facial expression recognition techniques detecting emotion of people’ using their facial expressions. This has found applications in technical fields such as Human-computer-Interaction (HCI) and security monitoring. It generally requires fast processing and decision making. Therefore, […]

CUDA

Oct, 4

Parallel Computing Using GPU for Efficient Traffic Simulation

Parallel Computing can be made possible using the multiple cores of the Graphics Processing Unit (GPU) thanks to the modern programmable GPU models. This allows the use of parallel computing techniques to improve upon the computation time of large scale traffic simulations. This paper proposes the use of a multi-processor algorithm for creating efficient traffic […]

CUDA

Oct, 4

Advanced Optimization Techniques for Sparse Grids on Modern Heterogeneous Systems

GPU based heterogeneous systems provide a peak performance in the order of TFlop/s and an advantageous ratio between performance and energy consumption. However, reaching high performance on GPUs is often a difficult task. This thesis proposes advanced optimization techniques that allow for efficiently porting a set of sparse grid algorithms to GPUs. The performance obtained […]

CUDA

Oct, 4

Cudagrind: A Valgrind Extension for CUDA

Valgrind, and specifically the included tool Memcheck, offers an easy and reliable way for checking the correctness of memory operations in programs. This works in an unintrusive way where Valgrind translates the program into intermediate code and executes it on an emulated CPU. The heavy weight tool Memcheck uses this to keep a full shadow […]

CUDA

Oct, 4

3D Non-Local Means denoising via multi-GPU

Non-Local Means (NLM) algorithm is widely considered as a state-of-the-art denoising filter in many research fields. High computational complexity led to implementations on Graphic Processor Unit (GPU) architectures, which achieve reasonable running times by filtering, slice-by-slice, 3D datasets with a 2D NLM approach. Here we present a fully 3D NLM implementation on a multi-GPU architecture […]

CUDA

Oct, 4

MC-RANSAC: A Pre-processing Model for RANSAC using Monte Carlo method implemented on a GPU

RANSAC is a repeating hypothesize-and-verify procedure for parameter estimation and filtering of noise or outlier data. In the traditional approach, this algorithm is evaluated without any prior information on the set of data points which leads to an increase in the number of iterations and compute time. In this paper, we present a GPU based […]

CUDA

Oct, 3

Towards Multi-GPU Support in the Marrow Skeleton Framework

A emerging trend in the field of Graphics Processing Unit (GPU) computing is the harnessing of multiple devices to tackle bigger problems and increase performance. Multi-GPU execution adds new challenges to the already complex world of General Purpose computing of GPUs (GPGPU), such as the efficient GPU-aware problem decomposition, and coping with heterogeneity. To this […]

OpenCL

Oct, 3

GPU-accelerated triangle-triangle intersection tester algorithm

The goal of the project is to develop a triangle-triangle collision algorithm. A reference triangle is given as well as a variably-sized array of many other triangles. The algorithm must check if one triangle intersects with the reference triangle. That operation has to be led for each "non-reference" triangle with the reference triangle. If one […]

CUDA

Oct, 3

Compiler Optimizations for SIMD/GPU/Multicore Architectures

In modern computer architectures, both SIMD (single-instruction multiple-data) instruction set extensions and GPUs can be used to accelerate the general purpose applications. In addition, the multicore machines can potentially provide more computation power for high performance computing with increasing number of cores and deeper cache hierarchies. However, writing high-performance codes manually for these architectures is […]

CUDA

Oct, 2

CUDA Enhanced Filtering in a Pipelined Video Processing Framework

The processing of digital video has long been a significant computational task for modern x86 processors. With every video frame composed of one to three planes, each consisting of a two-dimensional array of pixel data, and a video clip comprising of thousands of such frames, the sheer volume of data is significant. With the introduction […]

CUDA

Oct, 2

Parallel Hyperspectral Unmixing on GPUs

This letter presents a new parallel method for hyperspectral unmixing composed by the efficient combination of two popular methods: vertex component analysis (VCA) and sparse unmixing by variable splitting and augmented Lagrangian (SUNSAL). First, VCA extracts the end-member signatures, and then, SUNSAL is used to estimate the abundance fractions. Both techniques are highly parallelizable, which […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Performance Portability Evaluation for OpenACC on Intel Knights Corner and Nvidia Kepler

Facial Expression Recognition – Review

Parallel Computing Using GPU for Efficient Traffic Simulation

Advanced Optimization Techniques for Sparse Grids on Modern Heterogeneous Systems

Cudagrind: A Valgrind Extension for CUDA

3D Non-Local Means denoising via multi-GPU

MC-RANSAC: A Pre-processing Model for RANSAC using Monte Carlo method implemented on a GPU

Towards Multi-GPU Support in the Marrow Skeleton Framework

GPU-accelerated triangle-triangle intersection tester algorithm

Compiler Optimizations for SIMD/GPU/Multicore Architectures

CUDA Enhanced Filtering in a Pipelined Video Processing Framework

Parallel Hyperspectral Unmixing on GPUs

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)