high performance computing on graphics processing units: hgpu.org

Posts

Apr, 8

CANSCID-CUDA

The 2010 MEMOCODE Hardware Software Co-design challenge is to implement a Deep Packet Inspection architecture, called the CANSCID – Combined Architecture for Stream Categorization and Intrusion Detection. In this short paper, we present the design details of our submission, that utilizes a Graphical Processing Unit (GPU) to accelerate the parallel regular expression matching. The target […]

CUDA

Apr, 8

Shape Manipulation on GPU

This paper proposes a novel hardware-accelerating deformation algorithm based on curve-skeleton model for 2D shape manipulation. The deformation algorithm can achieve real-time interactive shape manipulation without any pre-computing step. The deforming regions of shapes are demarcated with a simple skeleton frame and are simulated by a curve-skeleton model consisting of triangle-strips. The algorithm obtains two […]

OpenGL

Apr, 7

High throughput multiple-precision GCD on the CUDA architecture

Investigation of the cryptanalytic strength of RSA cryptography requires computing many GCDs of two long integers (e.g., of length 1024 bits). This paper presents a high throughput parallel algorithm to perform many GCD computations concurrently on a GPU based on the CUDA architecture. The experiments with an NVIDIA GeForce GTX285 GPU and a single core […]

CUDA

Apr, 7

Program Optimization of Stencil Based Application on the GPU-Accelerated System

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a […]

Apr, 7

Scalability of Higher-Order Discontinuous Galerkin FEM Computations for Solving Electromagnetic Wave Propagation Problems on GPU Clusters

A highly parallel implementation of Maxwell’s equations in the time domain using a cluster of Graphics Processing Units (GPUs) is presented. The higher-order Discontinuous Galerkin Finite Element Method (DG-FEM) is used for spatial discretization since its characteristics are matching the parallelization design aspects of the NVIDIA Compute Unified Device Architecture (CUDA) programming model. Asynchronous data […]

CUDA

Apr, 7

Using GPU to Accelerate Cache Simulation

Caches play a major role in the performance of high speed computer systems. Trace driven simulator is the most widely used method to evaluate cache architectures. However, as the cache design moves to more complicated architectures, along with the size of the trace is becoming larger and larger. Traditional simulation methods are no longer practical […]

CUDA

Apr, 7

Accelerating Phase Correlation Functions Using GPU and FPGA

In this paper, we present a comparison study about implementations of phase correlation function using GPUs, ASIC and FPGAs. The Phase Only Correlation(POC) method demonstrates high robustness and subpixel accuracy in the pattern matching and the image registration. However, there is a disadvantage in computational speed because of the calculation of 2D-FFT etc. We have […]

CUDA

Apr, 7

A Neighborhood Grid Data Structure for Massive 3D Crowd Simulation on GPU

Simulation and visualization of emergent crowd in real-time is a computationally intensive task. This intensity mostly comes from the O(n2) complexity of the traversal algorithm, necessary for the proximity queries of all pair of entities in order to compute the relevant mutual interactions. Previous works reduced this complexity by considerably factors, using adequate data structures […]

CUDA

Apr, 7

Context-aware volume navigation

The trackball metaphor is exploited in many applications where volumetric data needs to be explored. Although it provides an intuitive way to inspect the overall structure of objects of interest, an in-detail inspection can be tedious – or when cavities occur even impossible. Therefore we propose a context-aware navigation technique for the exploration of volumetric […]

OpenCL

Apr, 7

Practical examples of GPU computing optimization principles

In this paper, we provide examples to optimize signal processing or visual computing algorithms written for SIMT-based GPU architectures. These implementations demonstrate the optimizations for CUDA or its successors OpenCL and DirectCompute. We discuss the effect and optimization principles of memory coalescing, bandwidth reduction, processor occupancy, bank conflict reduction, local memory elimination and instruction optimization. […]

CUDA

•

OpenCL

Apr, 7

ARKCoS: Artifact-Suppressed Accelerated Radial Kernel Convolution on the Sphere

We describe a hybrid Fourier/direct space convolution algorithm for compact radial (azimuthally symmetric) kernels on the sphere. For high resolution maps covering a large fraction of the sky, our implementation takes advantage of the inexpensive massive parallelism afforded by consumer graphics processing units (GPUs). Applications involve modeling of instrumental beam shapes in terms of compact […]

CUDA

Apr, 7

Scaling Hierarchical N-body Simulations on GPU Clusters

This paper focuses on the use of GPGPU-based clusters for hierarchical N-body simulations. Whereas the behavior of these hierarchical methods has been studied in the past on CPU-based architectures, we investigate key performance issues in the context of clusters of GPUs. These include kernel organization and efficiency, the balance between tree traversal and force computation […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

CANSCID-CUDA

Shape Manipulation on GPU

High throughput multiple-precision GCD on the CUDA architecture

Program Optimization of Stencil Based Application on the GPU-Accelerated System

Scalability of Higher-Order Discontinuous Galerkin FEM Computations for Solving Electromagnetic Wave Propagation Problems on GPU Clusters

Using GPU to Accelerate Cache Simulation

Accelerating Phase Correlation Functions Using GPU and FPGA

A Neighborhood Grid Data Structure for Massive 3D Crowd Simulation on GPU

Context-aware volume navigation

Practical examples of GPU computing optimization principles

ARKCoS: Artifact-Suppressed Accelerated Radial Kernel Convolution on the Sphere

Scaling Hierarchical N-body Simulations on GPU Clusters

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)