high performance computing on graphics processing units: hgpu.org

Posts

Jan, 25

Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic

Our problem is to accurately solve linear systems on a general purpose graphics processing unit with double double and quad double arithmetic. The linear systems originate from the application of Newton’s method on polynomial systems. Newton’s method is applied as a corrector in a path following method, so the linear systems are solved in sequence […]

CUDA

Jan, 25

Regularization and nonlinearities for neural language models: when are they needed?

We show that a recently proposed regularization method called random dropouts works well for language models based on neural networks when little training data is available. Random dropout regularization involves adding a certain kind of noise to the likelihood function being optimized and can be interpreted as a variational approximation to a new class of […]

CUDA

Jan, 25

Locality-Aware Work Stealing on Multi-CPU and Multi-GPU Architectures

Most recent HPC platforms have heterogeneous nodes composed of a combination of multi-core CPUs and accelerators, like GPUs. Scheduling on such architectures relies on a static partitioning and cost model. In this paper, we present a locality-aware work stealing scheduler for multi-CPU and multi-GPU architectures, which relies on the XKaapi runtime system. We show performance […]

CUDA

Jan, 25

Vlasov on GPU (VOG Project)

This work concerns the numerical simulation of the Vlasov-Poisson set of equations using semi- Lagrangian methods on Graphical Processing Units (GPU). To accomplish this goal, modifications to traditional methods had to be implemented. First and foremost, a reformulation of semi-Lagrangian methods is performed, which enables us to rewrite the governing equations as a circulant matrix […]

OpenCL

Jan, 25

A GPU-accelerated Direct-sum Boundary Integral Poisson-Boltzmann Solver

In this paper, we present a GPU-accelerated direct-sum boundary integral method to solve the linear Poisson-Boltzmann (PB) equation. In our method, a well-posed boundary integral formulation is used to ensure the fast convergence of Krylov subspace based linear algebraic solver such as the GMRES. The molecular surfaces are discretized with flat triangles and centroid collocation. […]

CUDA

Jan, 24

High Performance Lattice Boltzmann Solvers on Massively Parallel Architectures with Applications to Building Aeraulics

With the advent of low-energy buildings, the need for accurate building performance simulations has significantly increased. However, for the time being, the thermo-aeraulic effects are often taken into account through simplified or even empirical models, which fail to provide the expected accuracy. Resorting to computational fluid dynamics seems therefore unavoidable, but the required computational effort […]

CUDA

Jan, 24

Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems

We discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU […]

CUDA

Jan, 24

Hybrid Single/Double Precision Floating-Point Computation on GPU Accelerators for 2-D FDTD

Acceleration of FDTD (Finite-Difference TimeDomain) is very important in computational electromagnetic. We propose a hybrid single/double precision floating-point computation to accelerate FDTD on GPUs. We apply single-precision when the dynamic range of the electromagnetic field is low and double-precision when the dynamic range is high. According to the experimental results, we achieved over 35 times […]

CUDA

Jan, 24

Developing and Evaluating clOpenCL Applications for Heterogeneous Clusters

In the last few years, the computing systems processing capabilities have increased significantly, changing from single-core to multi-core and even many-core systems. Accompanying this evolution, local networks have also become faster, with multi-gigabit technologies like Infiniband, Myrinet and 10G Ethernet. Parallel/distributed programming tools and standards, like POSIX Threads, OpenMP and MPI, have helped to explore […]

CUDA

•

OpenCL

Jan, 24

Performance Study of Satellite Image Processing on Graphics Processors Unit Using CUDA

High resolution satellite images are now widely used for a variety of mapping applications including photogrammetry, GIS data acquisition and visualization. As the spectral and spatial data size of satellite images increases, a greater processing power is needed to process the images. The solution of these problems is parallel systems. Parallel processing techniques have been […]

CUDA

Jan, 24

GPU-based 3D Wavelet Transform

Wide amount of applications like volumetric medical data compression, video watermarking and video coding use the three-dimensional wavelet transform (3D-DWT) in their algorithms. In this work, we present GPU algorithms, based on both global and shared memory, to compute the 3D-DWT transform on both the GTX280 and the GMT540 platforms. The results obtained show that […]

CUDA

Jan, 23

Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization

GPUs are seeing increasingly widespread use for general purpose computation due to their excellent performance for highly-parallel, throughput-oriented applications. For many workloads, however, the performance benefits of offloading are hindered by the large and unpredictable overheads of launching GPU kernels and of transferring data between CPU and GPU. This paper proposes and evaluates hardware and […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Orthogonalization on a General Purpose Graphics Processing Unit with Double Double and Quad Double Arithmetic

Regularization and nonlinearities for neural language models: when are they needed?

Locality-Aware Work Stealing on Multi-CPU and Multi-GPU Architectures

Vlasov on GPU (VOG Project)

A GPU-accelerated Direct-sum Boundary Integral Poisson-Boltzmann Solver

High Performance Lattice Boltzmann Solvers on Massively Parallel Architectures with Applications to Building Aeraulics

Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems

Hybrid Single/Double Precision Floating-Point Computation on GPU Accelerators for 2-D FDTD

Developing and Evaluating clOpenCL Applications for Heterogeneous Clusters

Performance Study of Satellite Image Processing on Graphics Processors Unit Using CUDA

GPU-based 3D Wavelet Transform

Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)