high performance computing on graphics processing units: hgpu.org

Posts

Feb, 3

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

The QR decomposition with column pivoting (QRP) of a matrix is widely used for numerical rank revealing in applications. The performance of LAPACK implementation (DGEQP3) of the Householder QRP algorithm is limited by Level 2 BLAS operations required for updating the column norms. In this paper, we propose an implementation of the QRP algorithm using […]

CUDA

Feb, 3

Hybrid CPU-GPU Distributed Framework for Large Scale Mobile Networks Simulation

Most of the existing packet-level simulation tools are designed to perform experiments modeling a small to medium scale networks. The main reason of this limitation is the amount of available computation power and memory in quasi mono-process simulation environment. To enable efficient packet-level simulation for large scale scenario, we introduce a new CPUGPU co-simulation framework […]

CUDA

Feb, 3

JPEG 2000 Wireless Image Transmission System using Encryption Domain Authentication

In this paper, we propose a wireless high resolution video transmission system with encryption and authentication. The proposed system is implemented by JPEG 2000 coding. We implement JPEG 2000 coder by GPU in CUDA which is an integrated development environment for GPU, or by JPEG 2000 codec LSI. Moreover, the authentication system can check the […]

CUDA

Feb, 3

Fast and Maliciously Secure Two-Party Computation Using the GPU

We describe, and implement, a maliciously secure protocol for secure two-party computation, based on Yao’s garbled circuit and an efficient OT extension, in a parallel computational model. The implementation is done using CUDA and yields the fastest results for maliciously secure two-party computation in a realistic and practical setting by using a simple consumer grade […]

CUDA

Feb, 2

Software Reliability Enhancements for GPU Applications

As the role of highly-parallel accelerators becomes more important in high performance computing, so does the need to ensure their reliable operation. In applications where precision and correctness is a necessity, bit-level reliable operation is required. While there exist mechanisms for error detection and correction, the cost-effective implementation in massively parallel accelerators is still an […]

CUDA

Feb, 2

Portable Performance on Heterogeneous Architectures

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from […]

CUDA

•

OpenCL

Feb, 2

Heterogeneous GPU and CPU acceleration of a finite volume compressible flow solver for multiblock structured grids

The main objective of this project is to investigate the applications of heterogeneous acceleration to finite volume compressible flow solver for multiblock structured grids. Provided as Fortran source code, the ROTORMBMGS computational fluid dynamics program currently uses domain decomposition and message passing to distribute computation across multiple computers. Winning awards for scaling performance, there is […]

OpenCL

Feb, 2

Improving GPGPU Concurrency with Elastic Kernels

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 […]

CUDA

Feb, 2

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports […]

CUDA

Feb, 1

Embedding OpenCL in GHC Haskell

OpenCL defines a computation model for data-parallel code, supporting compilation to a variety of platforms, including both conventional x86 CPUs and commodity graphics hardware. OpenCL consists of both a programming language for writing data parallel code, called kernels, and an API, written in C, for interacting with the OpenCL platform and invoking OpenCL kernels. We […]

OpenCL

Feb, 1

Efficient Exploitation of Heterogeneous Platforms for Vertebra Detection in X-Ray Images

Back problems are often related to an abnormal condition of the spine. In this context, conventional X-Ray radiography is the most common modality used in emergency rooms since it is relatively inexpensive and fast. In this paper, we are interested in a method for detecting and extracting vertebrae on X-Ray images. In a medical context, […]

CUDA

Jan, 31

Validation of the PyGBe code for Poisson-Boltzmann equation with boundary element methods

The PyGBe code solves the linearized Poisson-Boltzmann equation using a boundary-integral formulation. We use a boundary element method with a collocation approach, and solve it via a Krylov-subspace method. To do this efficiently, the matrix-vector multiplications in the Krylov iterations are accelerated with a treecode, achieving O(N log N) complexity. The code presents a Python […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Hybrid CPU-GPU Distributed Framework for Large Scale Mobile Networks Simulation

JPEG 2000 Wireless Image Transmission System using Encryption Domain Authentication

Fast and Maliciously Secure Two-Party Computation Using the GPU

Software Reliability Enhancements for GPU Applications

Portable Performance on Heterogeneous Architectures

Heterogeneous GPU and CPU acceleration of a finite volume compressible flow solver for multiblock structured grids

Improving GPGPU Concurrency with Elastic Kernels

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

Embedding OpenCL in GHC Haskell

Efficient Exploitation of Heterogeneous Platforms for Vertebra Detection in X-Ray Images

Validation of the PyGBe code for Poisson-Boltzmann equation with boundary element methods

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)