high performance computing on graphics processing units: hgpu.org

Posts

Feb, 21

Computing least squares condition numbers on hybrid multicore/GPU systems

This paper presents an efficient computation for least squares conditioning or estimates of it. We propose performance results using new routines on top of the multicore-GPU library MAGMA. This set of routines is based on an efficient computation of the variance-covariance matrix for which, to our knowledge, there is no implementation in current public domain […]

CUDA

Feb, 21

MICA: A fast short-read aligner that takes full advantage of Intel Many Integrated Core Architecture (MIC)

BACKGROUND: Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alternative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, […]

Feb, 21

Effects of Easy Hybrid Parallelization with CUDA for Numerical-Atomic-Orbital Density Functional Theory Calculation

We modified a MPI-friendly density functional theory (DFT) source code within hybrid parallelization including CUDA. Our objective is to find out how simple conversions within the hybrid parallelization with mid-range GPUs affect DFT code not originally suitable to CUDA. We settled several rules of hybrid parallelization for numerical-atomic-orbital (NAO) DFT codes. The test was performed […]

CUDA

Feb, 19

X-ray CT on the GPU

Nondestructive testing (NDT) is a collection of analysis techniques used by scientists and technologists as a way of analyzing the interior of an object without damaging the object. Since the analysis is done without damaging the object, NDT is an extremely valuable technique used in various industries for troubleshooting and research. CNDE has a long […]

CUDA

Feb, 19

Large-Scale Geospatial Processing on Multi-Core and Many-Core Processors: Evaluations on CPUs, GPUs and MICs

Geospatial Processing, such as queries based on point-to-polyline shortest distance and point-in-polygon test, are fundamental to many scientific and engineering applications, such as post-processing large-scale environmental and climate model outputs and analyzing traffic and travel patterns from massive GPS collections in transportation engineering and urban studies. Commodity parallel hardware, such as multi-core CPUs, many-core GPUs […]

CUDA

Feb, 19

Using of GPUs for cluster analysis of large data by K-means method

This problem was solved within the framework of the grant project "Solving of problems of cluster analysis with application of parallel algorithms and cloud technologies" in the Institute of Mathematics and Mathematical Modelling in Almaty. The problem of cluster analysis for the large amount of data is very important in different areas of science – […]

CUDA

Feb, 19

Parallel algorithms for problems of cluster analysis with very large amount of data

In this paper we solve on GPUs massive problems with large amount of data, which are not appropriate for solution with the SIMD technology. For the given problem we consider a three-level parallelization. The multithreading of CPU is used at the top level and graphic processors for massive computing. For solving problems of cluster analysis […]

CUDA

Feb, 19

Fast Hamiltonian Monte Carlo Using GPU Computing

In recent years, the Hamiltonian Monte Carlo (HMC) algorithm has been found to work more efficiently compared to other popular Markov Chain Monte Carlo (MCMC) methods (such as random walk Metropolis-Hastings) in generating samples from a posterior distribution. A general framework for HMC based on the use of graphical processing units (GPUs) is shown to […]

CUDA

Feb, 17

Towards a Performance-Portable FFT Library for Heterogeneous Computing

The fast Fourier transform (FFT), a spectral method that computes the discrete Fourier transform and its inverse, pervades many applications in digital signal processing, such as imaging, tomography, and software-defined radio. Its importance has caused the research community to expend significant resources to accelerate the FFT, of which FFTW is the most prominent example. With […]

OpenCL

Feb, 17

A Similarity-Based Analysis Tool for Scientific Application Porting

Porting applications to a new system is a nontrivial job in the HPC field. It is a very time-consuming, labor-intensive process, and the quality of the results will depend critically on the experience of the experts involved. In order to ease the porting process, we propose a methodology to address an important aspect of software […]

CUDA

Feb, 17

The battle of the giants: a case study of GPU vs FPGA optimisation for real-time image processing

This paper focuses on a thorough comparison of the two main hardware targets for real-time optimization of a computer vision algorithm: GPU and FPGA. Based on a complex case study algorithm for threaded isle detection, implementation on both hardware targets is compared in terms of resulting time performance, code translation effort, hardware cost, power efficiency […]

OpenCL

Feb, 17

Petascale elliptic solvers for anisotropic PDEs on GPU clusters

Memory bound applications such as solvers for large sparse systems of equations remain a challenge for GPUs. Fast solvers should be based on numerically efficient algorithms and implemented such that global memory access is minimised. To solve systems with up to one trillion (10^12) unknowns the code has to make efficient use of several million […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Computing least squares condition numbers on hybrid multicore/GPU systems

MICA: A fast short-read aligner that takes full advantage of Intel Many Integrated Core Architecture (MIC)

Effects of Easy Hybrid Parallelization with CUDA for Numerical-Atomic-Orbital Density Functional Theory Calculation

X-ray CT on the GPU

Large-Scale Geospatial Processing on Multi-Core and Many-Core Processors: Evaluations on CPUs, GPUs and MICs

Using of GPUs for cluster analysis of large data by K-means method

Parallel algorithms for problems of cluster analysis with very large amount of data

Fast Hamiltonian Monte Carlo Using GPU Computing

Towards a Performance-Portable FFT Library for Heterogeneous Computing

A Similarity-Based Analysis Tool for Scientific Application Porting

The battle of the giants: a case study of GPU vs FPGA optimisation for real-time image processing

Petascale elliptic solvers for anisotropic PDEs on GPU clusters

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)