high performance computing on graphics processing units: hgpu.org

Posts

Nov, 9

Parallel Implementation of Niblack’s Binarization Approach on CUDA

Image processing and pattern recognition algorithms take more time for execution on a single core processor. Graphics Processing Unit (GPU) is more popular now-a-days due to their speed, programmability, low cost and more inbuilt execution cores in it. Most of the researchers started work to use GPUs as a processing unit with a single core […]

CUDA

Nov, 8

GPU-based Signal Processing Scheme for Bioinspired Optical Flow

The aim of this work contribution is the neuromorphic low-power GPU implementation of the processing stages for robust and multichannel optical flow estimation that permits highly parallel real-time filtering.

CUDA

Nov, 8

PATUS: A Code Generation and Auto-Tuning Framework For Parallel Stencil Computations

PATUS is a code generation and auto-tuning framework for stencil computations targeted at modern multi- and many-core processors, such as multicore CPUs and graphics processing units. Its ultimate goals are to provide a means towards productivity and performance on current and future multi- and many-core platforms. The framework generates the code for a compute kernel […]

CUDA

Nov, 8

GPU Cluster with MATLAB

This paper presents the architecture of an heterogeneous cluster where each node has one or more Graphical Unit Processors (GPUs). The motivation of the work is the fact that this technology presents very impressive results in High Performance Computing at a very low cost and very small energy consumption so. Although this might not be […]

CUDA

Nov, 8

Acceleration of Hessenberg Reduction for Nonsymmetric Eigenvalue Problems in a Hybrid CPU-GPU Computing Environment

Solution of large-scale dense nonsymmetric eigenvalue problem is required in many areas of scientific and engineering computing, such as vibration analysis of automobiles and analysis of electronic diffraction patterns. In this study, we focus on the Hessenberg reduction step and consider accelerating it in a hybrid CPU-GPU computing environment. Considering that the Hessenberg reduction algorithm […]

CUDA

Nov, 8

Graphics Processing Unit Utilization in Circuit Simulation

Graphics processing units (GPU) of today include hundreds of multi-threaded, multicore processors and a complex, high-bandwidth memory architecture, making them a good alternative to speed up general-purpose parallel computation where large data quantities are processed with same functions. Some successful applications of GPU computation have also been introduced in the field of circuit simulation. The […]

CUDA

Nov, 8

20th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, PDP 2012

The Special Session on GPU Computing and Hybrid Computing aims at providing a forum for scientific researchers and engineers on hot topics related to GPU computing and hybrid computing with special emphasis on applications, performance analysis, programming models and mechanisms for mapping codes. Topics: GPU computing, multi GPU processing, hybrid computing; Programming models, programming frameworks, […]

Nov, 8

Innovative Parallel Computing: Foundations & Applications of GPU, Manycore, and Heterogeneous Systems, InPar 2012

InPar 2012 is co-located with NVidia’s GPU Technology Conference. This new conference provides a first-tier academic venue for peer-reviewed publications in the emerging fields of parallel computing, encompassing the topics of GPU computing, manycore computing, and heterogeneous computing. InPar has dual focus on “Foundations” — the fundamental advances in parallel computing itself and “Applications” — […]

Nov, 8

Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA’s Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different GPUs, ranging from low-end consumergrade products to high-end HPC-grade […]

CUDA

Nov, 8

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

Three out of the top four supercomputers in the November 2010 TOP500 list of the world’s most powerful supercomputers use NVIDIA GPUs to accelerate computations. Ninety-five systems from the list are using processors with six or more cores. Three-hundred-sixty-five systems use quad-core processor-based systems. Thirty-seven systems are using dual-core processors. The large-scale enabling of hybrid […]

Nov, 8

High performance massively parallel direct N-body simulations on large GPU clusters

We present direct astrophysical N-body simulations with up to six million bodies using our parallel MPI/CUDA code on large GPU clusters in China, with different kinds of GPU hardware. These clusters are directly linked under the Chinese Academy of Sciences special GPU cluster program. We reach about one third of the peak GPU performance for […]

CUDA

Nov, 8

The Infrared behavior of SU(3) Nf=12 gauge theory -about the existence of conformal fixed point-

Incorporated with twisted boundary condition, Polyakov loop correlators can give a definition of the renormalized coupling. We employ this scheme for the step scaling method (with step size s = 2) in the search of conformal fixed point of SU(3) gauge theory with 12 massless flavors. Staggered fermion and plaquette gauge action are used in […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Parallel Implementation of Niblack’s Binarization Approach on CUDA

GPU-based Signal Processing Scheme for Bioinspired Optical Flow

PATUS: A Code Generation and Auto-Tuning Framework For Parallel Stencil Computations

GPU Cluster with MATLAB

Acceleration of Hessenberg Reduction for Nonsymmetric Eigenvalue Problems in a Hybrid CPU-GPU Computing Environment

Graphics Processing Unit Utilization in Circuit Simulation

20th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, PDP 2012

Innovative Parallel Computing: Foundations & Applications of GPU, Manycore, and Heterogeneous Systems, InPar 2012

Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

High performance massively parallel direct N-body simulations on large GPU clusters

The Infrared behavior of SU(3) Nf=12 gauge theory -about the existence of conformal fixed point-

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)