high performance computing on graphics processing units: hgpu.org

Posts

Oct, 11

GPU Accelarated Multi-Block Lattice Boltzmann Solver for Viscous Flow Problems

We developed a lattice Boltzmann Solver, which can be used for the solution of low Reynolds number flow problems. Then, we modified it to run on Graphical Processing Unit using Compute Unified Device Architecture, which is a parallel computing platform and programming model created by NVIDIA. Comparison of the results that we obtained on Graphical […]

CUDA

Oct, 11

Performance Analysis of an Astrophysical Simulation Code on the Intel Xeon Phi Architecture

We have developed the astrophysical simulation code XFLAT to study neutrino oscillations in supernovae. XFLAT is designed to utilize multiple levels of parallelism through MPI, OpenMP, and SIMD instructions (vectorization). It can run on both CPU and Xeon Phi co-processors based on the Intel Many Integrated Core Architecture (MIC). We analyze the performance of XFLAT […]

Oct, 11

Accelerating the D3Q19 Lattice Boltzmann Model with OpenACC and MPI

Multi-GPU implementations of the Lattice Boltzmann method are of practical interest as they allow the study of turbulent flows on large-scale simulations at high Reynolds numbers. Although programming GPUs, and in general power-efficient accelerators, typically guarantees high performances, the lack of portability in their low-level programming models implies significant efforts for maintainability and porting of […]

Oct, 11

GPU acceleration of preconditioned solvers for ill-conditioned linear systems

In this work we study the implementations of deflation and preconditioning techniques for solving ill-conditioned linear systems using iterative methods. Solving such systems can be a time-consuming process because of the jumps in the coefficients due to large difference in material properties. We have developed implementations of the iterative methods with these preconditioning techniques on […]

Oct, 8

Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit

In this article, we introduce CURRENNT, an open-source parallel implementation of deep recurrent neural networks (RNNs) supporting graphics processing units (GPUs) through NVIDIA’s Computed Unified Device Architecture (CUDA). CURRENNT supports uni- and bidirectional RNNs with Long Short-Term Memory (LSTM) memory cells which overcome the vanishing gradient problem. To our knowledge, CURRENNT is the first publicly […]

CUDA

Oct, 8

GPU-Based Computation of 2D Least Median of Squares with Applications to Fast and Robust Line Detection

The 2D Least Median of Squares (LMS) is a popular tool in robust regression because of its high breakdown point: up to half of the input data can be contaminated with outliers without affecting the accuracy of the LMS estimator. The complexity of 2D LMS estimation has been shown to be $Omega(n^2)$ where $n$ is […]

CUDA

Oct, 8

Kinematic Modelling of Disc Galaxies using Graphics Processing Units

With large-scale Integral Field Spectroscopy (IFS) surveys of thousands of galaxies currently under-way or planned, the astronomical community is in need of methods, techniques and tools that will allow the analysis of huge amounts of data. We focus on the kinematic modelling of disc galaxies and investigate the potential use of massively parallel architectures, such […]

CUDA

Oct, 8

Solving the Quadratic Assignment Problem on heterogeneous environment (CPUs and GPUs) with the application of Level 2 Reformulation and Linearization Technique

The Quadratic Assignment Problem, QAP, is a classic combinatorial optimization problem, classified as NP-hard and widely studied. This problem consists in assigning N facilities to N locations obeying the relation of 1 to 1, aiming to minimize costs of the displacement between the facilities. The application of Reformulation and Linearization Technique, RLT, to the QAP […]

CUDA

Oct, 8

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rCUDA Virtualization

OmpSs is a task-parallel programming model consisting of a reduced collection of OpenMP-like directives, a front-end compiler, and a runtime system. This directive-based programming interface helps developers accelerate their application’s execution, e.g. in a cluster equipped with graphics processing units (GPUs), with a low programming effort. On the other hand, the virtualization package rCUDA provides […]

CUDA

Oct, 6

CVC: The Contourlet Video Compression algorithm for real-time applications

Nowadays, real-time video communication over the internet through video conferencing applications has become an invaluable tool in everyone’s professional and personal life. This trend underlines the need for video coding algorithms that provide acceptable quality on low bitrates and can support various resolutions inside the same stream in order to cope with limitations on computational […]

CUDA

Oct, 6

MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing

Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries […]

CUDA

Oct, 6

Parallel Graph Algorithms on the Xeon Phi Coprocessor

Complex networks have received interest in a wide area of applications, ranging from road networks over hyperlink connections in the world wide web to interactions between people. Advanced algorithms are required for the generation as well as visualization of such graphs. In this work two graph algorithms, one for graph generation, the other for graph […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPU Accelarated Multi-Block Lattice Boltzmann Solver for Viscous Flow Problems

Performance Analysis of an Astrophysical Simulation Code on the Intel Xeon Phi Architecture

Accelerating the D3Q19 Lattice Boltzmann Model with OpenACC and MPI

GPU acceleration of preconditioned solvers for ill-conditioned linear systems

Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit

GPU-Based Computation of 2D Least Median of Squares with Applications to Fast and Robust Line Detection

Kinematic Modelling of Disc Galaxies using Graphics Processing Units

Solving the Quadratic Assignment Problem on heterogeneous environment (CPUs and GPUs) with the application of Level 2 Reformulation and Linearization Technique

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rCUDA Virtualization

CVC: The Contourlet Video Compression algorithm for real-time applications

MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing

Parallel Graph Algorithms on the Xeon Phi Coprocessor

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)