high performance computing on graphics processing units: hgpu.org

Posts

Sep, 16

Run-time Image and Video Resizing Using CUDA-enabled GPUs

A recently proposed approach, called seam carving, has been widely used for content-aware resizing of images and videos with little to no perceptible distortion. Unfortunately, for high-resolution videos and large images it is not computationally feasible to do the resizing in real-time using small-scale CPU systems. In this paper, we exploit highly parallel computational capabilities […]

CUDA

Sep, 16

On the Performance and Energy-efficiency of Multi-core SIMD CPUs and CUDA-enabled GPUs

This paper explores the performance and energy efficiency of CUDA-enabled GPUs and multi-core SIMD CPUs using a set of kernels and full applications. Our implementations efficiently exploit both SIMD and thread-level parallelism on multi-core CPUs and the computational capabilities of CUDA-enabled GPUs. We discuss general optimization techniques for our CPU-only and CPU-GPU platforms. To fairly […]

CUDA

Sep, 16

Hybrid Monte Carlo CT Simulation on GPU

Developing image reconstruction algorithms for diagnostic medical devices requires physically accurate and effective simulation tools. In this paper we present a hybrid Monte Carlo (MC) particle simulation method for Computed Tomography (CT) scanners. To meet the performance requirements, we combine several variance reduction techniques and tailor the algorithms for effective GPU execution. Variance reduction methods […]

CUDA

Sep, 16

Faster Multiple Pattern Matching System on GPU based on Bit-Parallelism

In this paper, we propose fast string matching system using GPU for large scale string matching. The key of our proposed system is the use of bit-parallel pattern matching approach for compact NFA representation and fast simulation of NFA transition on GPU. In the experiments, we show the usefulness of our proposed pattern matching system.

CUDA

Sep, 15

Performance analysis of SSE instructions in multi-core CPUs and GPU computing on FDTD scheme for solid and fluid vibration problems

In this work a unified treatment of solid and fluid vibration problems is developed by means of the Finite-Difference Time-Domain (FDTD). The scheme here proposed introduces a scaling factor in the velocity fields that improves the performance of the method and the vibration analysis in heterogenous media. In order to accurately reproduce the interaction of […]

CUDA

Sep, 15

Algorithmic GPGPU Memory Optimization

The performance of General-Purpose computation on Graphics Processing Units (GPGPU) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on GPUs and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved […]

OpenCL

Sep, 15

Expressed Sequence Tag Clustering using Commercial Gaming Hardware

In this dissertation we had the aim of utilizing GPU technology in order to optimize and improve on the problem of EST clustering. Extensive research on this cross-disciplinary approach was required before even considering such an approach. It was found that though this line of research has not received significant attention, there are significant gains […]

CUDA

Sep, 15

Porting to the Intel Xeon Phi: Opportunities and Challenges

This work describes the challenges presented by porting code to the Intel Xeon Phi coprocessor, as well as opportunities for optimization and tuning. We use micro-benchmarks, code segments, assembly listings and application level results to illustrate the key issues in porting to the Xeon Phi coprocessor, always keeping in mind both portability and performance. While […]

CUDA

Sep, 15

Quine-McCluskey algorithm on GPGPU

This paper deals with parallelization of the Quine-McCluskey algorithm. This boolean function minimization algorithm has a limitation when dealing with more than four variables. The problem computed by this algorithm is NP-hard and runtime of the algorithm grows exponentially with the number of variables. The goal is to show that parallel implementation of the Quine-McCluskey […]

CUDA

Sep, 14

A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling

Computational Neuroscience is an emerging field that provides unique opportunities to study complex brain structures through realistic neural simulations. However, as biological details are added to models, the execution time for the simulation becomes longer. Graphics Processing Units (GPUs) are now being utilized to accelerate simulations due to their ability to perform computations in parallel. […]

CUDA

Sep, 14

GPU-based Parallel Reservoir Simulators

We have developed a GPU-based parallel linear solver package. When solving matrices from reservoir simulation, the parallel solvers are much more efficient than CPU-based linear solvers. However, efforts should be made to improve the setup phase of domain decomposition, the factorization of ILUT and parallelism of block ILUT preconditioner.

CUDA

Sep, 14

A GPU-based Affine and Scale Invariant Feature Transform Algorithm

Affine invariance is one of the main performances of a good feature extraction algorithm. SIFT is a kind of scale-invariant feature extraction algorithm, but it is not affine invariant. In order to improve SIFT algorithm’s affine invariance. Affine and Scale Invariant Feature Transform (ASIFT) algorithm takes affine Model into SIFT. However, serial ASIFT algorithm’s computing […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Run-time Image and Video Resizing Using CUDA-enabled GPUs

On the Performance and Energy-efficiency of Multi-core SIMD CPUs and CUDA-enabled GPUs

Hybrid Monte Carlo CT Simulation on GPU

Faster Multiple Pattern Matching System on GPU based on Bit-Parallelism

Performance analysis of SSE instructions in multi-core CPUs and GPU computing on FDTD scheme for solid and fluid vibration problems

Algorithmic GPGPU Memory Optimization

Expressed Sequence Tag Clustering using Commercial Gaming Hardware

Porting to the Intel Xeon Phi: Opportunities and Challenges

Quine-McCluskey algorithm on GPGPU

A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling

GPU-based Parallel Reservoir Simulators

A GPU-based Affine and Scale Invariant Feature Transform Algorithm

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)