high performance computing on graphics processing units: hgpu.org

Posts

Apr, 13

Accelerate Cache Simulation with Generic GPU

Trace-driven cache simulation is the most widely used method to evaluate different cache structures. Several techniques have been proposed to reduce the simulation time of sequential trace-driven simulation. An obvious way to achieve fast parallel simulation is to simulate the individual independent sets of a cache concurrently on different compute resources. We propose improvements to […]

CUDA

Apr, 13

NQueens on CUDA: Optimization Issues

Todays commercial off-the-shelf computer systems are multicore computing systems as a combination of CPU, graphic processor (GPU) and custom devices. In comparison with CPU cores, graphic cards are capable to execute hundreds up to thousands compute units in parallel. To benefit from these GPU computing resources, applications have to be parallelized and adapted to the […]

CUDA

Apr, 13

Memory Saving Discrete Fourier Transform on GPUs

This paper will show an alternative method to compute the two-dimensional Discrete Fourier Transform. While current GPU Fourier transform libraries need a large buffer for storing intermediate results, our method can compute the same output with far less memory. This will function by exploiting the separability of the Fourier transform. Using this scheme, it is […]

OpenCL

Apr, 13

A Multi-GPU Spectrometer System for Real-Time Wide Bandwidth Radio Signal Analysis

This paper describes the implementation of a large bandwidth multi-GPU signal processing system for radio astronomy observation. This system performs very large Fast Fourier Transform (FFT) and spectrum analysis to achieve real-time analysis of a large bandwidth spectrum. This is accomplished by implementing a four-step FFT algorithm in Compute Unified Device Architecture (CUDA). The key […]

CUDA

Apr, 13

GPU-Accelerated KLT Tracking with Monte-Carlo-Based Feature Reselection

Many computer vision methods rely on frame registration information obtained with algorithms such as the Kanade-Lucas-Tomasi (KLT) feature tracker, which is known for its excellent performance in that area. Various research groups proposed methods to extend its performance, both in terms of execution time and stability. Recent research has shown that current graphics processing units […]

Apr, 13

Efficient Collision Detection and Physics-Based Deformation for Haptic Simulation with Local Spherical Hash

While real time computer graphics rely on a frame rate of 30 iterations per second to fool the eye and render smooth motion transitions, computer haptics deals with the sense of touch, which requires a higher rate of around 1kHz to avoid discontinuities. The use of haptics on interactive applications as surgical simulations or games, […]

CUDA

Apr, 13

Community Structure Discovery algorithm on GPU with CUDA

The automatic search and community discovery in large and complex network has important practical application. It is difficult to be tradeoff in computing speed and clustering exactness. To improve clustering exactness have to decrease the time complexity. In this paper a novel single instruction Multiple Data architecture processors based on Newman algorithm is proposed. The […]

CUDA

Apr, 13

Efficient parallelized particle filter design on CUDA

Particle filtering is widely used in numerous nonlinear applications which require reconfigurability, fast prototyping, and online parallel signal processing. The emerging computing platform, CUDA, may be regarded as the most appealing platform for such implementation. However, there are not yet literatures exploring how to utilize CUDA for particle filters. This parer aims to provide two […]

CUDA

Apr, 12

Implementation and optimization of image processing algorithms on handheld GPU

The advent of GPUs with programmable shaders on handheld devices has motivated embedded application developers to utilize GPU to offload computationally intensive tasks and relieve the burden from embedded CPU. In this work, we propose an image processing toolkit on handheld GPU with programmable shaders using OpenGL ES 2.0 API. By using the image processing […]

OpenGL

Apr, 12

Power analysis and optimizations for GPU architecture using a power simulator

As one of the most popular many-core architecture, GPUs have illustrated power in many non-graphic applications. Traditional general purpose computing systems tend to integrate GPU as the co-processor to accelerate parallel computing tasks. Meanwhile, GPUs also result in high power consumption, which accounts for a large proportion of the total system power consumption. In this […]

Apr, 12

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Some of the most notable ones are isolating the part of the algorithm that can be optimized to run on the GPU; tuning the program […]

CUDA

Apr, 12

Automated development of applications for graphical processing units using rewriting rules

Recently there was an active development of parallel programming methods concerning implementation of general-purpose algorithms on graphical processing units (GPUs). Using this specialized hardware allows increasing performance significantly, but requires low-level programming and understanding details of underlying hardware and software platform. Therefore there is a need for automating development process. This paper presents a technique […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Accelerate Cache Simulation with Generic GPU

NQueens on CUDA: Optimization Issues

Memory Saving Discrete Fourier Transform on GPUs

A Multi-GPU Spectrometer System for Real-Time Wide Bandwidth Radio Signal Analysis

GPU-Accelerated KLT Tracking with Monte-Carlo-Based Feature Reselection

Efficient Collision Detection and Physics-Based Deformation for Haptic Simulation with Local Spherical Hash

Community Structure Discovery algorithm on GPU with CUDA

Efficient parallelized particle filter design on CUDA

Implementation and optimization of image processing algorithms on handheld GPU

Power analysis and optimizations for GPU architecture using a power simulator

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

Automated development of applications for graphical processing units using rewriting rules

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)