high performance computing on graphics processing units: hgpu.org

Posts

Apr, 7

On Password Guessing with GPUs and FPGAs

Passwords are still by far the most widely used form of user authentication, for applications ranging from online banking or corporate network access to storage encryption. Password guessing thus poses a serious threat for a multitude of applications. Modern password hashes are specifically designed to slow down guessing attacks. However, having exact measures for the […]

CUDA

Apr, 4

OmpSs task offload

Exascale performance requires a level of energy efficiency only achievable with specialized hardware. Hence, to build a general purpose HPC system with exascale performance different types of processors, memory technologies and interconnection networks will be necessary. Heterogeneous hardware is already present on some top supercomputer systems that are composed of different compute nodes, which at […]

CUDA

•

OpenCL

Apr, 4

Reduction of a Symmetrical Matrix to Tridiagonal Form on GPUs

Many eigenvalue and eigenvector algorithms begin with reducing the input matrix into a tridiagonal form. A tridiagonal matrix is a matrix that has non-zero elements only on its main diagonal, and the two diagonals directly adjacent to it. Reducing a matrix to a tridiagonal form is an iterative process which uses Jacobi rotations to reduce […]

CUDA

Apr, 4

An Effective Model of CPU/GPU Collaborative Computing in GPU Clusters

Remote procedure call (RPC) is a simple, transparent and useful paradigm for providing communication between two processes across a network. The compute unified device architecture (CUDA) programming toolkit and runtime enhance the programmability of the graphics processing unit (GPU) and make GPU more versatile in high performance computing. The current researches mainly focus on the […]

CUDA

Apr, 4

The Design and Implementation of a Verification Technique for GPU Kernels

We present a technique for the formal verification of GPU kernels, addressing two classes of correctness properties: data races and barrier divergence. Our approach is founded on a novel formal operational semantics for GPU kernels termed synchronous, delayed visibility (SDV) semantics, which captures the execution of a GPU kernel by multiple groups of threads. The […]

CUDA

•

OpenCL

Apr, 4

Using OpenCL to Implement Median Filtering and RSA Algorithms: Two GPGPU Application Case Studies

Graphics Processing Units (GPU) and their development tools have advanced recently, and industry has become more interested in using them. Among several development frameworks for GPU(s), OpenCL provides a programming environment to write portable code that can run in parallel. This report describes two case studies of algorithm implementations in OpenCL. The first algorithm is […]

OpenCL

Apr, 1

Distributed wideband software-defined radio receiver for heterogeneous systems

Recent years have seen an increasing need for computationally efficient implementation of software-defined radio (SDR) systems. Given the limitations of a typical SDR application running on a single machine, we present a distributed SDR system using high-performance techniques. To split a digital signal into multiple channels, we use an efficient digital signal processing technique: a […]

OpenCL

Apr, 1

Generating Null Models for Large-Scale Networks on GPU

A network generated by randomly rewiring the edges of an original network on some constraint conditions is called the null model of the original network. It’s a useful tool for revealing some mechanisms affecting the topology of networks. As the scales of networks become larger and larger, time consumption of generating null models increases. How […]

Apr, 1

Microbranching in mode-I fracture using large scale simulations of amorphous and perturbed lattice models

We study the high-velocity regime mode-I fracture instability using large scale simulations. At large driving displacements, the pattern of a single, steady-state crack that propagates in the midline of the sample breaks down, and small microbranches start to appear near the main crack. Some of the features of those microbranches have been reproduced qualitatively in […]

CUDA

Apr, 1

Separable projection integrals for higher-order correlators of the cosmic microwave sky: Acceleration by factors exceeding 100

We study the optimisation and porting of the "Modal" code on Intel(R) Xeon(R) processors and/or Intel(R) Xeon Phi(TM) coprocessors using methods which should be applicable to more general compute bound codes. "Modal" is used by the Planck satellite experiment for constraining general non-Gaussian models of the early universe via the bispectrum of the cosmic microwave […]

Apr, 1

Parameter Selection and Pre-Conditioning for a Graph Form Solver

In a recent paper, Parikh and Boyd describe a method for solving a convex optimization problem, where each iteration involves evaluating a proximal operator and projection onto a subspace. In this paper we address the critical practical issues of how to select the proximal parameter in each iteration, and how to scale the original problem […]

CUDA

Mar, 30

Massively Parallel Analysis of Similarity Matrices on Heterogeneous Hardware

We conduct a study that investigates the performance characteristics of a set of parallel implementations of the recurrence quantification analysis (RQA) using OpenCL. Being an important tool in climate impact and medical research, a central aspect of RQA is the construction of a binary matrix that captures the similarities of multi-dimensional vectors. Based on this […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

On Password Guessing with GPUs and FPGAs

OmpSs task offload

Reduction of a Symmetrical Matrix to Tridiagonal Form on GPUs

An Effective Model of CPU/GPU Collaborative Computing in GPU Clusters

The Design and Implementation of a Verification Technique for GPU Kernels

Using OpenCL to Implement Median Filtering and RSA Algorithms: Two GPGPU Application Case Studies

Distributed wideband software-defined radio receiver for heterogeneous systems

Generating Null Models for Large-Scale Networks on GPU

Microbranching in mode-I fracture using large scale simulations of amorphous and perturbed lattice models

Separable projection integrals for higher-order correlators of the cosmic microwave sky: Acceleration by factors exceeding 100

Parameter Selection and Pre-Conditioning for a Graph Form Solver

Massively Parallel Analysis of Similarity Matrices on Heterogeneous Hardware

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)