high performance computing on graphics processing units: hgpu.org

Posts

Mar, 16

Developing a CUDA solver for large sparse matrices for MARIN

This masters thesis has been written for the degree of Master of Science in Applied Mathematics at the faculty of Electrical Engineering, Mathematics and Computer Sciences of Delft University of Technology. The report ends a nine month internship carried out at Maritime Research Institute Netherlands (MARIN). MARIN supplies innovative products for the offshore industry and […]

CUDA

Mar, 16

Multi-platform Linear Algebra

HiFlow3 is a multi-purpose finite element software providing powerful tools for efficient and accurate solution of a wide range of problems modeled by partial differential equations (PDEs). Based on object-oriented concepts and the full capabilities of C++ the HiFlow3 project follows a modular and generic approach for building efficient parallel numerical solvers. It provides highly […]

CUDA

Mar, 15

On the Use of Small 2D Convolutions on GPUs

Computing many small 2D convolutions using FFTs is a basis for a large number of applications in many domains in science and engineering, among them electromagnetic diffraction modeling in physics. The GPU architecture seems to be a suitable architecture to accelerate these convolutions, but reaching high application performance requires substantial development time and non-portable optimizations. […]

CUDA

Mar, 15

Iterative Statistical Kernels on Contemporary GPUs

We present a study of three important kernels that occur frequently in iterative statistical applications: Multi-Dimensional Scaling (MDS), PageRank, and K-Means. We implemented each kernel using OpenCL and evaluated their performance on NVIDIA Tesla and NVIDIA Fermi GPGPU cards using dedicated hardware, and in the case of Fermi, also on the Amazon EC2 cloud-computing environment. […]

OpenCL

Mar, 15

Performance analysis and optimization of the OP2 framework on many-core architectures

This paper presents a benchmarking, performance analysis and optimization study of the OP2 ‘active’ library, which provides an abstraction framework for the parallel execution of unstructured mesh applications. OP2 aims to decouple the scientific specification of the application from its parallel implementation, and thereby achieve code longevity and near-optimal performance through re-targeting the application to […]

CUDA

Mar, 15

Compressed Multiple-Row Storage Format

A new format for storing sparse matrices is proposed for efficient sparse matrix-vector (SpMV) product calculation on modern throughput-oriented computer architectures. This format extends the standard compressed row storage (CRS) format and is easily convertible to and from it without any memory overhead. Computational performance of an SpMV kernel for the new format is determined […]

CUDA

Mar, 15

A Spiking Neural P system simulator based on CUDA

In this paper we present a Spiking Neural P system (SNP system) simulator based on graphics processing units (GPUs). In particular we implement the simulator using NVIDIA CUDA enabled GPUs. The massively parallel architecture of current GPUs is very suitable for the maximally parallel computations of SNP systems. We simulate a wider variety of SNP […]

CUDA

Mar, 13

Targeting heterogeneous architectures via macro data flow

We propose a data flow based run time system as an efficient tool for supporting execution of parallel code on heterogeneous architectures hosting both multicore CPUs and GPUs. We discuss how the proposed run time system may be the target of both structured parallel applications developed using algorithmic skeletons/parallel design patterns and also more "domain […]

Mar, 13

Expressive Array Constructs in an Embedded GPU Kernel Programming Language

Graphics Processing Units (GPUs) are powerful computing devices that with the advent of CUDA/OpenCL are becomming useful for general purpose computations. Obsidian is an embedded domain specific language that generates CUDA kernels from functional descriptions. A symbolic array construction allows us to guarantee that intermediate arrays are fused away. However, the current array construction has […]

CUDA

•

OpenCL

Mar, 13

Parallel Branch and Bound on a CPU-GPU System

Hybrid implementation via CUDA of a branch and bound method for knapsack problems is proposed. Branch and bound computations can be carried out either on the CPU or on the GPU according to the size of the branch and bound list, i.e. the number of nodes. Tests are carried out on a Tesla C2050 GPU. […]

CUDA

Mar, 13

Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries

With GPU architectures becoming increasingly important due to their large number of parallel processors, NVIDIA’s CUDA environment is becoming widely used to support general purpose applications. To efficiently use the parallel processing power, programmers need to efficiently parallelize and map their algorithms. The difficulty of this task leads to the idea to investigate CUDA’s compiler. […]

CUDA

Mar, 13

Real-time execution of image change detection

State-of-the-art video analysis systems feature multiple complex processing steps and operate on high resolution images. Intensive computation power is needed for real-time execution. In this project an image change detection application is mapped to a heterogeneous multicore CPU/GPU platform. It is investigated what hardware configuration is required to execute the application in real-time. For optimal […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Developing a CUDA solver for large sparse matrices for MARIN

Multi-platform Linear Algebra

On the Use of Small 2D Convolutions on GPUs

Iterative Statistical Kernels on Contemporary GPUs

Performance analysis and optimization of the OP2 framework on many-core architectures

Compressed Multiple-Row Storage Format

A Spiking Neural P system simulator based on CUDA

Targeting heterogeneous architectures via macro data flow

Expressive Array Constructs in an Embedded GPU Kernel Programming Language

Parallel Branch and Bound on a CPU-GPU System

Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries

Real-time execution of image change detection

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)