high performance computing on graphics processing units: hgpu.org

Posts

Jan, 19

GPU Computing for Meshfree Particle Method

Graphics Processing Units (GPUs), originally developed for computer games, now provide computational power for scientific applications. A study on the comparison of computational speed-up and efficiency of a GPU with a CPU for the Finite Pointset Method (FPM), which is a numerical tool in Computational Fluid Dynamics (CFD) is presented. As FPM is based on […]

CUDA

Jan, 18

High-performance and Embedded Systems for Cryptography

This thesis addresses the design of cryptographic accelerators, ranging from the embedded system to the high-performance computing device. New techniques are proposed to allow several cryptographic algorithms to be computed by the same target. Therefore, flexibility (to support several algorithms) and scalability (to extend the features of a designed accelerator) are two keywords in all […]

OpenCL

Jan, 18

Supporting x86-64 Address Translation for 100s of GPU Lanes

Efficient memory sharing between CPU and GPU threads can greatly expand the effective set of GPGPU workloads. For increased programmability, this memory should be uniformly virtualized, necessitating compatible address translation support for GPU memory references. However, even a modest GPU might need 100s of translations per cycle (6 CUs * 64 lanes/CU) with memory access […]

CUDA

Jan, 18

Improving the Performance of CA-GMRES on Multicores with Multiple GPUs

The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming […]

CUDA

Jan, 18

A GPU-based Multi-level Subspace Decomposition Scheme for Hierarchical Tensor Product Bases

The aim of this thesis is to implement a multi-level splitting of full grids on the GPU, which could be used in the incremental visualization of scientific data sets. The splitting is motivated by the approximation properties of the sparse grid technique. Looking towards large amounts of data, ideas of parallelization and data slicing are […]

CUDA

Jan, 18

Computing Spatial Distance Histograms for Large Scientific Datasets On-the-Fly

This paper focuses on an important query in scientific simulation data analysis: the Spatial Distance Histogram (SDH). The computation time of an SDH query using brute force method is quadratic. Often, such queries are executed continuously over certain time periods, increasing the computation time. We propose highly efficient approximate algorithm to compute SDH over consecutive […]

CUDA

Jan, 17

Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accelerator

We examine the Xeon Phi, which is based on Intel’s Many Integrated Cores architecture, for its suitability to run the FDK algorithm–the most commonly used algorithm to perform the 3D image reconstruction in cone-beam computed tomography. We study the challenges of efficiently parallelizing the application and means to enable sensible data sharing between threads despite […]

Jan, 17

Power Profiling of GeMTC Many Task Computing

GeMTC allows for Many Task Computing (MTC) workloads to run on hardware accelerators allowing for advantages that come from the many-core architecture. However, presently GeMTC is only written to take advantage of NVIDIA GPUs. Another such hardware accelerator, the Intel Xeon Phi, is also an excellent candidate for MTC workloads. Therefore, the first goal of […]

CUDA

Jan, 17

GPU Accelerated Vessel Segmentation Using Laplacian Eigenmaps

Laplacian eigenmap is one of the most widely used techniques to improve cluster-based segmentation of multivariate images. However, one problem with this approach is its excessive computational requirements, especially when processing large image datasets. In this paper, we aim to employ the emerging commodity graphics hardware of eigenmap-based segmentation. In particular, we present a highly […]

CUDA

Jan, 17

Prefiltered Single Scattering

Volumetric light scattering is a complex phenomenon that is difficult to simulate in real time as light can be scattered towards the camera from everywhere in space. By assuming a single-scattering model, we can transform the usually-employed ray-marching into an efficient ray-independent texture filtering process. Our algorithm builds upon a rectified shadow map as input […]

OpenGL

Jan, 17

Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA’s GPU. Not only the compute intensive components of the H.264 encoder are parallelized, but also the control intensive components are realized effectively, […]

CUDA

Jan, 16

MRPB: Memory Request Prioritization for Massively Parallel Processors

Massively parallel, throughput-oriented systems such as graphics processing units (GPUs) offer high performance for a broad range of programs. They are, however, complex to program, especially because of their intricate memory hierarchies with multiple address spaces. In response, modern GPUs have widely adopted caches, hoping to providing smoother reductions in memory access traffic and latency. […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPU Computing for Meshfree Particle Method

High-performance and Embedded Systems for Cryptography

Supporting x86-64 Address Translation for 100s of GPU Lanes

Improving the Performance of CA-GMRES on Multicores with Multiple GPUs

A GPU-based Multi-level Subspace Decomposition Scheme for Hierarchical Tensor Product Bases

Computing Spatial Distance Histograms for Large Scientific Datasets On-the-Fly

Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accelerator

Power Profiling of GeMTC Many Task Computing

GPU Accelerated Vessel Segmentation Using Laplacian Eigenmaps

Prefiltered Single Scattering

Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

MRPB: Memory Request Prioritization for Massively Parallel Processors

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)