high performance computing on graphics processing units: hgpu.org

Posts

Sep, 22

A case for neuromorphic ISAs

The desire to create novel computing systems, paired with recent advances in neuroscientific understanding of the brain, has led researchers to develop neuromorphic architectures that emulate the brain. To date, such models are developed, trained, and deployed on the same substrate. However, excessive co-dependence between the substrate and the algorithm prevents portability, or at the […]

CUDA

Sep, 22

Acceleration of the speed of tissue characterization algorithm for coronary plaque by employing GPGPU technique

The general purpose computation technique on Graphics Processing Unit (GPGPU) has got into the limelight recently. The authors have proposed the multiple k-nearest neighbor (MkNN) classifier for the tissue characterization of coronary plaque. Its characterization performance is highly evaluated. The purpose of this paper is to accelerate the speed of MkNN classifier aiming for it […]

CUDA

Sep, 21

Reconstructing hash reversal based proof of work schemes

Proof of work schemes use client puzzles to manage limited resources on a server and provide resilience to denial of service attacks. Attacks utilizing GPUs to inflate computational capacity, known as resource inflation, are a novel and powerful threat that dramatically increase the computational disparity between clients. This disparity renders proof of work schemes based […]

CUDA

Sep, 21

Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple […]

CUDA

Sep, 21

Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs

Fourier Domain Optical Coherence Tomography (FD-OCT) is an emerging biomedical imaging technology featuring ultra-high resolution and fast imaging speed. Due to the complexity of the FD-OCT algorithm, real time FD-OCT imaging demands high performance computing platforms. However, the scaling of real-time FD-OCT processing for increasing data acquisition rates and 3-dimensional (3D) imaging is quickly outpacing […]

CUDA

Sep, 21

Non-deterministic parallelism considered useful

The development of distributed execution engines has greatly simplified parallel programming, by shielding developers from the gory details of programming in a distributed system, and allowing them to focus on writing sequential code [8, 11, 18]. The "sacred cow" in these systems is transparent fault tolerance, which is achieved by dividing the computation into atomic […]

Sep, 21

Parallel graduated assignment algorithm for multiple graph matching based on a common labelling

This paper presents a new parallel algorithm to compute multiple graph-matching based on the Graduated Assignment. The aim of developing this parallel algorithm is to perform multiple graph matching in a current desktop computer, but, instead of executing the code in the generic processor, we execute a parallel code in the graphic processor unit. Our […]

CUDA

Sep, 21

GPU-based cloud performance for LiDAR data processing

Goal of this paper is to compare the timing/performance results of CPU and GPU on local and Cloud platform for processing massive Light Detecting and Ranging (LiDAR) topographic data. We have used locally various multi-core CPU technologies as well as GPU implementations on various graphics cards of nVidia which support CUDA, where as a cloud […]

CUDA

Sep, 21

Multicore performance optimization using partner cores

As the push for parallelism continues to increase the number of cores on a chip, system design has become incredibly complex; optimizing for performance and power efficiency is now nearly impossible for the application programmer. To assist the programmer, a variety of techniques for optimizing performance and power at runtime have been developed, but many […]

Sep, 21

Mint: realizing CUDA performance in 3D stencil methods with annotated C

We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional […]

CUDA

Sep, 20

Scalable heterogeneous parallelism for atmospheric modeling and simulation

Heterogeneous multicore chipsets with many levels of parallelism are becoming increasingly common in high-performance computing systems. Effective use of parallelism in these new chipsets constitutes the challenge facing a new generation of large scale scientific computing applications. This study examines methods for improving the performance of two-dimensional and three-dimensional atmospheric constituent transport simulation on the […]

Sep, 20

Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program’s execution time. Today’s computer systems have tremendous […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

A case for neuromorphic ISAs

Acceleration of the speed of tissue characterization algorithm for coronary plaque by employing GPGPU technique

Reconstructing hash reversal based proof of work schemes

Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs

Non-deterministic parallelism considered useful

Parallel graduated assignment algorithm for multiple graph matching based on a common labelling

GPU-based cloud performance for LiDAR data processing

Multicore performance optimization using partner cores

Mint: realizing CUDA performance in 3D stencil methods with annotated C

Scalable heterogeneous parallelism for atmospheric modeling and simulation

Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)