high performance computing on graphics processing units: hgpu.org

Posts

Jan, 13

HiDP: A Hierarchical Data Parallel Language

Problem domains are commonly decomposed hierarchically to fully utilize parallel resources in modern microprocessors. Such decompositions can be provided as library routines, written by experienced experts, for general algorithmic patterns. But such APIs tend to be constrained to certain architectures or data sizes. Integrating them with application code is often an unnecessarily daunting task, especially […]

CUDA

Jan, 13

A master-slave robotic simulator based on GPUDirect

The same as in traditional surgery, surgeons in telerobotic surgery need extensive training to achieve experience and highly accurate instrument manipulation. Traditional training methods like practice in operating room have major drawbacks such as high risk and limited opportunity for which virtual reality (VR) and computer technologies can offer solutions. To accelerate the data transmission […]

CUDA

Jan, 13

Acceleration of Selective Cationic Antibacterial Peptides computation: A comparison of FPGA and GPU approaches

Prediction of physicochemical properties of peptide sequences can be used for the identification of "Selective Cationic Amphipatic Antibacterial Peptides" (SCAAP), with possible applications in different diseases treatment. The exhaustive computation of physicochemical properties of peptide sequences can lead to reduce the search space of SCAAP, but the combinatorial complexity of these calculations is a high-performance […]

CUDA

Jan, 13

Toward Practical Real-Time Photon Mapping: Efficient GPU Density Estimation

We describe the design space for real-time photon density estimation, the key step of rendering global illumination (GI) via photon mapping. We then detail and analyze efficient GPU implementations of four best-of-breed algorithms. All produce reasonable results on NVIDIA GeForce 670 at 1920×1080 for complex scenes with multiple-bounce diffuse effects, caustics, and glossy reflection in […]

CUDA

Jan, 13

Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application

Parallel computing architectures are becoming more complex with increasing core counts and more heterogeneous architectures. However, the most commonly used programming models, C/C++ with MPI and/or OpenMP, make it very difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such […]

CUDA

Jan, 12

Accelerating Topic Model Training on a Single Machine

We present the design and implementation of GLDA, a library that utilizes the GPU (Graphics Processing Unit) to perform Gibbs sampling of Latent Dirichlet Allocation (LDA) on a single machine. LDA is an effective topic model used in many applications, e.g., classification, feature selection, and information retrieval. However, training an LDA model on large data […]

CUDA

Jan, 12

Parallel Catmull-Rom Spline Interpolation Algorithm for Image Zooming Based on CUDA

In order to scale video image real-timely, a GPU-aided parallel interpolation algorithm was proposed. Catmull-Rom Spline algorithm for image zooming was reformed into SIMD (Single instruction, multiple data) mode according to CUDA programming model. Re-sampling of each pixel was completed by a GPU thread. Hence, time-consuming re-sampling procedure of the whole zooming process were handled […]

CUDA

Jan, 12

Implementing Sparse Matrix-Vector Multiplication with QCSR on GPU

We are going through the computation from single core to multicore architecture in parallel programming. Graphics Processor Units (GPUs) have recently emerged as outstanding platforms for data parallel applications with regular data access patterns. However, it is still challenging to optimize computations with irregular data access patterns like sparse matrix-vector multiplication (SPMV). SPMV is one […]

CUDA

Jan, 12

Exploring the Feasibility of Fully Homomorphic Encryption

In a major breakthrough, Gentry introduced the first plausible construction of a fully homomorphic encryption (FHE) scheme in 2009. FHE allows the evaluation of arbitrary functions directly on encrypted data on untrusted servers. Later, in 2010 Gentry-Halevi presented the first FHE implementation. However, even for the small setting with 2,048 dimensions, the authors reported a […]

CUDA

Jan, 12

Intrusion Detection Architecture Utilizing Graphics Processors

With the thriving technology and the great increase in the usage of computer networks, the risk of having these network to be under attacks have been increased. Number of techniques have been created and designed to help in detecting and/or preventing such attacks. One common technique is the use of Intrusion Detection Systems (IDS). Today, […]

CUDA

Jan, 12

Lattice Boltzmann simulations of the permeability and capillary adsorption of cement model microstructures

The lattice Boltzmann method is used to investigate the permeability of microstructures of cement pastes generated using the numerical models CEMHYD3D (Bentz, 1997) and mIC (Bishnoi and Scrivener, 2009). Results are reported as a function of paste water-to-cement ratio and degree of hydration. The permeability decreases with increasing hydration and decreasing water-to-cement ratio in agreement […]

CUDA

Jan, 11

Evaluating Reconfigurable Dataflow Computing Using the Himeno Benchmark

Heterogeneous computing using FPGA accelerators is a promising approach to boost the performance of application programs within given power consumption. This paper focuses on optimizations targeting FPGA-based reconfigurable dataflow computing platform, and shows how they benefit an application. In order to evaluate them, we use the Himeno benchmark, which is a floating point computation kernel […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

HiDP: A Hierarchical Data Parallel Language

A master-slave robotic simulator based on GPUDirect

Acceleration of Selective Cationic Antibacterial Peptides computation: A comparison of FPGA and GPU approaches

Toward Practical Real-Time Photon Mapping: Efficient GPU Density Estimation

Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application

Accelerating Topic Model Training on a Single Machine

Parallel Catmull-Rom Spline Interpolation Algorithm for Image Zooming Based on CUDA

Implementing Sparse Matrix-Vector Multiplication with QCSR on GPU

Exploring the Feasibility of Fully Homomorphic Encryption

Intrusion Detection Architecture Utilizing Graphics Processors

Lattice Boltzmann simulations of the permeability and capillary adsorption of cement model microstructures

Evaluating Reconfigurable Dataflow Computing Using the Himeno Benchmark

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)