high performance computing on graphics processing units: hgpu.org

Posts

Nov, 24

Improving the Performance of the Linear Systems Solvers Using CUDA

Parallel computing can offer an enormous advantage regarding the performance for very large applications in almost any field: scientific computing, computer vision, databases, data mining, and economics. GPUs are high performance many-core processors that can obtain very high FLOP rates. Since the first idea of using GPU for general purpose computing, things have evolved and […]

CUDA

Nov, 24

Enhancing and Porting the HPC-Lab Snow Simulator to OpenCL on Mobile Platforms

Porting a computationally demanding CUDA application to a GPU designed for mobile phones and tablets, which supports OpenCL, is the subject of this thesis. Significant effort is made to prepare the snow simulator of the HPC-LAB at IDI, NTNU, for porting to an OpenCL capable GPU for mobile phones, with a reasonably limited effort, when […]

OpenCL

Nov, 24

GPU Isosurface Raycasting of FCC Datasets

This paper presents an efficient and accurate isosurface rendering algorithm for the natural C^1 splines on the face-centered cubic (FCC) lattice. Leveraging fast and accurate evaluation of a spline field and its gradient, accompanied by efficient empty-space skipping, the approach generates high-quality isosurfaces of FCC datasets at interactive speed (20-70 fps). The pre-processing computation (quasi-interpolation […]

OpenCL

•

OpenGL

Nov, 23

Auto-tuning on the macro scale: high level algorithmic auto-tuning for scientific applications

In this thesis, we describe a new classification of auto-tuning methodologies spanning from low-level optimizations to high-level algorithmic tuning. This classification spectrum of auto-tuning methods encompasses the space of tuning parameters from low-level optimizations (such as block sizes, iteration ordering, vectorization, etc.) to high-level algorithmic choices (such as whether to use an iterative solver or […]

Nov, 23

Evaluation of Two Parallel Finite Element Implementations of the Time-Dependent Advection Diffusion Problem: GPU versus Cluster Considering Time and Energy Consumption

We analyze two parallel finite element implementations of the 2D time-dependent advection diffusion problem, one for multi-core clusters and one for CUDA-enabled GPUs, and compare their performances in terms of time and energy consumption. The parallel CUDA-enabled GPU implementation was derived from the multi-core cluster version. Our experimental results show that a desktop machine with […]

CUDA

Nov, 23

GPU Acceleration of Transmural Electrophysiological Imaging

Tranmural electrophysiological imaging (TEPI) is becoming a possibility with the aid of 3D in silico cardiac EP models and the statistical estimation theory. By quasi Monte-Carlo (MC) simulation of the 3D EP models on the subject-specific anatomical model, complex and physiologically meaningful spatiotemporal priors are produced to achieve the 2D-to-3D transition of EP data, an […]

CUDA

Nov, 23

Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer

For scalable 3-D FFT computation using multiple GPUs, efficient all-to-all communication between GPUs is the most important factor in good performance. Implementations with point-to-point MPI library functions and CUDA memory copy APIs typically exhibit very large overheads especially for small message sizes in all-to-all communications between many nodes. We propose several schemes to minimize the […]

CUDA

Nov, 23

Efficient reconstruction of biological networks via transitive reduction on general purpose graphics processors

BACKGROUND: Techniques for reconstruction of biological networks which are based on perturbation experimentsoften predict direct interactions between nodes that do not exist. Transitive reduction removes suchrelations if they can be explained by an indirect path of in influences. The existing algorithms fortransitive reduction are sequential and might suffer from too long run times for large […]

CUDA

Nov, 22

GPU Implementation of Fuzzy Anisotropic Diffusion

In this paper, we present a GPU-based implementation of the Fuzzy-Anisotropic diffusion technique oriented for high-resolution multidimensional image/video techniques. The aggregation of parallel computing and the HW/SW co-design techniques are used in order to improve the time performance of the Fuzzy-Anisotropic Diffusion algorithm for image/video applications. Experimental results show the significantly increased performance efficiency both […]

CUDA

Nov, 22

CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration […]

CUDA

Nov, 22

Automatic generation of software pipelines for heterogeneous parallel systems

Pipelining is a well-known approach to increasing parallelism and performance. We address the problem of software pipelining for heterogeneous parallel platforms that consist of different multi-core and many-core processing units. In this context, pipelining involves two key steps—partitioning an application into stages and mapping and scheduling the stages onto the processing units of the heterogeneous […]

CUDA

Nov, 22

Tera-scale Astronomical Data Analysis and Visualization

We present a high-performance, graphics processing unit (GPU)-based framework for the efficient analysis and visualization of (nearly) terabyte (TB)-sized 3-dimensional images. Using a cluster of 96 GPUs, we demonstrate for a 0.5 TB image: (1) volume rendering using an arbitrary transfer function at 7–10 frames per second; (2) computation of basic global image statistics such […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Improving the Performance of the Linear Systems Solvers Using CUDA

Enhancing and Porting the HPC-Lab Snow Simulator to OpenCL on Mobile Platforms

GPU Isosurface Raycasting of FCC Datasets

Auto-tuning on the macro scale: high level algorithmic auto-tuning for scientific applications

Evaluation of Two Parallel Finite Element Implementations of the Time-Dependent Advection Diffusion Problem: GPU versus Cluster Considering Time and Energy Consumption

GPU Acceleration of Transmural Electrophysiological Imaging

Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer

Efficient reconstruction of biological networks via transitive reduction on general purpose graphics processors

GPU Implementation of Fuzzy Anisotropic Diffusion

CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

Automatic generation of software pipelines for heterogeneous parallel systems

Tera-scale Astronomical Data Analysis and Visualization

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)