high performance computing on graphics processing units: hgpu.org

Posts

Jun, 17

Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA

OpenACC is an application programming interface (API) that aims to unleash the power of heterogeneous systems composed of CPUs and accelerators such as graphic processing units (GPUs) or Intel Xeon Phi coprocessors. This directive-based programming model is intended to enable developers to accelerate their application’s execution with much less effort. Coprocessors offer significant computing power […]

CUDA

Jun, 17

GPU-Enabled Particle-Particle Particle-Tree Scheme for Simulating Dense Stellar Cluster System

We describe the implementation and performance of the P^3T (Particle-Particle Particle-Tree) scheme for simulating dense stellar systems. In P^3T, the force experienced by a particle is split into short-range and long-range contributions. Short-range forces are evaluated by direct summation and integrated with the fourth order Hermite predictor-corrector method with the block timesteps. For long-range forces, […]

CUDA

Jun, 17

Automatic Data Layout Optimizations for GPUs

Memory optimizations have became increasingly important in order to fully exploit the computational power of modern GPUs. The data arrangement has a big impact on the performance, and it is very hard for GPU programmers to identify a well-suited data layout. Classical data layout transformations include grouping together data fields that have similar access patterns, […]

OpenCL

Jun, 17

Layered Interpretation of Street View Images

We propose a layered street view model to encode both depth and semantic information on street view images for autonomous driving. Recently, stixels, stix-mantics, and tiered scene labeling methods have been proposed to model street view images. We propose a 4-layer street view model, a compact representation over the recently proposed stix-mantics model. Our layers […]

CUDA

Jun, 16

Perfect Hashing Structures for Parallel Similarity Searches

Seed-based heuristics have proved to be efficient for studying similarity between genetic databases with billions of base pairs. This paper focuses on algorithms and data structures for the filtering phase in seed-based heuristics, with an emphasis on efficient parallel GPU/manycores implementation. We propose a 2-stage index structure which is based on neighborhood indexing and perfect […]

OpenCL

Jun, 16

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Graph algorithms are used in several domains such as social networking, biological sciences, computational geometry, and compilers, to name a few. It has been shown that they possess enough parallelism to keep several computing resources busy – even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware […]

CUDA

Jun, 16

Characterizing Dataset Dependence for Sparse Matrix-Vector Multiplication on GPUs

Sparse matrix-vector multiplication (SpMV) is a widely used kernel in scientific applications as well as data analytics. Many GPU implementations of SpMV have been proposed, proposing different sparse matrix representations. However, no sparse matrix representation is consistently superior, and the best representation varies for sparse matrices with different sparsity patterns. In this paper we study […]

CUDA

Jun, 16

GPU Predictor-Corrector Interior Point Method for Large-Scale Linear Programming

This master’s thesis concerns the implementation of a GPUaccelerated version of Mehrotra’s predictor-corrector interior point algorithm for large-scale linear programming (LP). The implementations are tested on LP problems arising in the financial industry, where there is high demand for faster LP solvers. The algorithm was implemented in C++, MATLAB and CUDA, using double precision for […]

CUDA

Jun, 16

Parallelization of DIRA and CTmod using OpenMP and OpenCL

Parallelization is the answer to the ever-growing demands of computing power by taking advantage of multi-core processor technology and modern many-core graphics compute units. Multi-core CPUs and many-core GPUs have the potential to substantially reduce the execution time of a program but it is often a challenging task to ensure that all available hardware is […]

OpenCL

Jun, 14

Type-safe Runtime Code Generation: Accelerate to LLVM

Embedded languages are often compiled at application runtime; thus, embedded compile-time errors become application runtime errors. We argue that advanced type system features, such as GADTs and type families, play a crucial role in minimising such runtime errors. Specifically, a rigorous type discipline reduces runtime errors due to bugs in both embedded language applications and […]

CUDA

Jun, 14

Automatic Selection of Sparse Matrix Representation on GPUs

Sparse matrix-vector multiplication (SpMV) is a core kernel in numerous applications, ranging from physics simulation and large-scale solvers to data analytics. Many GPU implementations of SpMV have been proposed, targeting several sparse representations and aiming at maximizing overall performance. No single sparse matrix representation is uniformly superior, and the best performing representation varies for sparse […]

CUDA

Jun, 14

A GPU vs CPU performance evaluation of an experimental video compression algorithm

Modern video compression algorithms put significant strain on a system’s CPU, especially for video encoding. The ever increasing demands for using video compression algorithms in a wide range of applications necessitate the use of processing components that boost the speed and quality of the video compression algorithm’s execution. The vast parallel computational capabilities of modern […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA

GPU-Enabled Particle-Particle Particle-Tree Scheme for Simulating Dense Stellar Cluster System

Automatic Data Layout Optimizations for GPUs

Layered Interpretation of Street View Images

Perfect Hashing Structures for Parallel Similarity Searches

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Characterizing Dataset Dependence for Sparse Matrix-Vector Multiplication on GPUs

GPU Predictor-Corrector Interior Point Method for Large-Scale Linear Programming

Parallelization of DIRA and CTmod using OpenMP and OpenCL

Type-safe Runtime Code Generation: Accelerate to LLVM

Automatic Selection of Sparse Matrix Representation on GPUs

A GPU vs CPU performance evaluation of an experimental video compression algorithm

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)