high performance computing on graphics processing units: hgpu.org

Posts

Oct, 5

Clock Math – A System for Solving SLEs Exactly

In this paper, we present a GPU-accelerated hybrid system that solves ill-conditioned systems of linear equations exactly. Exactly means without rounding errors due to using integer arithmetics. First, we scale floating-point numbers up to integers, then we solve dozens of SLEs within different modular arithmetics and then we assemble sub-solutions back using the Chinese remainder […]

OpenCL

Oct, 5

GPU Based Generation and Real-Time Rendering of Semi-Procedural Terrain Using Features

Generation and real-time rendering of terrain is a complex and multifaceted problem. Besides the obvious trade-offs between performance and quality, many different generation and rendering solutions exist. Different choices in implementation will result in very different visuals, usability and tools for generation. In this thesis, a fast and intuitive terrain generation method based on sketching […]

OpenCL

Oct, 4

Performance Portability Evaluation for OpenACC on Intel Knights Corner and Nvidia Kepler

OpenACC is a programming standard designed to simplify heterogeneous parallel programming by using directives. Since OpenACC can generate OpenCL and CUDA code, meanwhile running OpenCL on Intel Knight Corner is supported by CAPS HMPP compiler, it is attractive to using OpenACC on hardwares with different underlying microarchitectures. This paper studies how realistic it is to […]

Oct, 4

Facial Expression Recognition – Review

Expression recognition (happy, sad, disgust, surprise, angry, fear expressions) is application of advanced object detection, pattern recognition and classification task. Facial expression recognition techniques detecting emotion of people’ using their facial expressions. This has found applications in technical fields such as Human-computer-Interaction (HCI) and security monitoring. It generally requires fast processing and decision making. Therefore, […]

CUDA

Oct, 4

Parallel Computing Using GPU for Efficient Traffic Simulation

Parallel Computing can be made possible using the multiple cores of the Graphics Processing Unit (GPU) thanks to the modern programmable GPU models. This allows the use of parallel computing techniques to improve upon the computation time of large scale traffic simulations. This paper proposes the use of a multi-processor algorithm for creating efficient traffic […]

CUDA

Oct, 4

Advanced Optimization Techniques for Sparse Grids on Modern Heterogeneous Systems

GPU based heterogeneous systems provide a peak performance in the order of TFlop/s and an advantageous ratio between performance and energy consumption. However, reaching high performance on GPUs is often a difficult task. This thesis proposes advanced optimization techniques that allow for efficiently porting a set of sparse grid algorithms to GPUs. The performance obtained […]

CUDA

Oct, 4

Cudagrind: A Valgrind Extension for CUDA

Valgrind, and specifically the included tool Memcheck, offers an easy and reliable way for checking the correctness of memory operations in programs. This works in an unintrusive way where Valgrind translates the program into intermediate code and executes it on an emulated CPU. The heavy weight tool Memcheck uses this to keep a full shadow […]

CUDA

Oct, 4

3D Non-Local Means denoising via multi-GPU

Non-Local Means (NLM) algorithm is widely considered as a state-of-the-art denoising filter in many research fields. High computational complexity led to implementations on Graphic Processor Unit (GPU) architectures, which achieve reasonable running times by filtering, slice-by-slice, 3D datasets with a 2D NLM approach. Here we present a fully 3D NLM implementation on a multi-GPU architecture […]

CUDA

Oct, 4

MC-RANSAC: A Pre-processing Model for RANSAC using Monte Carlo method implemented on a GPU

RANSAC is a repeating hypothesize-and-verify procedure for parameter estimation and filtering of noise or outlier data. In the traditional approach, this algorithm is evaluated without any prior information on the set of data points which leads to an increase in the number of iterations and compute time. In this paper, we present a GPU based […]

CUDA

Oct, 3

Towards Multi-GPU Support in the Marrow Skeleton Framework

A emerging trend in the field of Graphics Processing Unit (GPU) computing is the harnessing of multiple devices to tackle bigger problems and increase performance. Multi-GPU execution adds new challenges to the already complex world of General Purpose computing of GPUs (GPGPU), such as the efficient GPU-aware problem decomposition, and coping with heterogeneity. To this […]

OpenCL

Oct, 3

GPU-accelerated triangle-triangle intersection tester algorithm

The goal of the project is to develop a triangle-triangle collision algorithm. A reference triangle is given as well as a variably-sized array of many other triangles. The algorithm must check if one triangle intersects with the reference triangle. That operation has to be led for each "non-reference" triangle with the reference triangle. If one […]

CUDA

Oct, 3

Compiler Optimizations for SIMD/GPU/Multicore Architectures

In modern computer architectures, both SIMD (single-instruction multiple-data) instruction set extensions and GPUs can be used to accelerate the general purpose applications. In addition, the multicore machines can potentially provide more computation power for high performance computing with increasing number of cores and deeper cache hierarchies. However, writing high-performance codes manually for these architectures is […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Clock Math – A System for Solving SLEs Exactly

GPU Based Generation and Real-Time Rendering of Semi-Procedural Terrain Using Features

Performance Portability Evaluation for OpenACC on Intel Knights Corner and Nvidia Kepler

Facial Expression Recognition – Review

Parallel Computing Using GPU for Efficient Traffic Simulation

Advanced Optimization Techniques for Sparse Grids on Modern Heterogeneous Systems

Cudagrind: A Valgrind Extension for CUDA

3D Non-Local Means denoising via multi-GPU

MC-RANSAC: A Pre-processing Model for RANSAC using Monte Carlo method implemented on a GPU

Towards Multi-GPU Support in the Marrow Skeleton Framework

GPU-accelerated triangle-triangle intersection tester algorithm

Compiler Optimizations for SIMD/GPU/Multicore Architectures

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)