high performance computing on graphics processing units: hgpu.org

Posts

Sep, 15

Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-Core Architectures

Tridiagonal solvers are important building blocks for a wide range of scientific applications that are commonly performance-sensitive. Recently, many-core architectures, such as GPUs, have become ubiquitous targets for these applications. Therefore, a high-performance general-purpose GPU tridiagonal solver becomes critical. However, no existing GPU tridiagonal solver provides comparable quality of solutions to most common, general-purpose CPU […]

CUDA

Sep, 13

Parallel Computation of Non-Bonded Interactions in Drug Discovery: Nvidia GPUs vs. Intel Xeon Phi

Currently, medical research for the discovery of new drugs is increasingly using Virtual Screening (VS) methods. In these methods, the calculation of the non-bonded interactions, such as electrostatic or van der Waals, plays an important role, representing up to 80% of the total execution time. These are computationally intensive operations, and massively parallel in nature, […]

CUDA

Sep, 13

Parallel CYK Membership Test on GPUs

Nowadays general-purpose computing on graphics processing units (GPGPUs) performs computations what were formerly handled by the CPU using hundreds of cores on GPUs. It often improves the performance of sequential computation when the running program is well-structured and formulated for massive threading. The CYK algorithm is a well-known algorithm for the context-free language membership test […]

CUDA

Sep, 13

Analysis of GPU-based convolution for acoustic wave propagation modeling with finite differences: Fortran to CUDA-C step-by-step

By projecting observed microseismic data backward in time to when fracturing occurred, it is possible to locate the fracture events in space, assuming a correct velocity model. In order to achieve this task in near real-time, a robust computational system to handle backward propagation, or Reverse Time Migration (RTM), is required. We can then test […]

CUDA

Sep, 13

Performance and Power Optimization of GPU Architectures for General-purpose Computing

Power-performance efficiency has become a central focus that is challenging in heterogeneous processing platforms as the power constraints have to be established without hindering the high performance. In this dissertation, a framework for optimizing the power and performance of GPUs in the context of general-purpose computing in GPUs (GPGPU) is proposed. To optimize the leakage […]

CUDA

Sep, 13

HTML5 WebSocket protocol and its application to distributed computing

HTML5 WebSocket protocol brings real time communication in web browsers to a new level. Daily, new products are designed to stay permanently connected to the web. WebSocket is the technology enabling this revolution. WebSockets are supported by all current browsers, but it is still a new technology in constant evolution. WebSockets are slowly replacing older […]

OpenCL

Sep, 11

Pattern Matching in OpenCL: GPU vs CPU Energy Consumption on Two Mobile Chipsets

Adaptations of the Aho-Corasick (AC) algorithm on high performance graphics processors (also called GPUs) have garnered increasing attention in recent years. However, no results have been reported regarding their implementations on mobile GPUs. In this paper, we show that implementing a state-of-the-art Aho-Corasick parallel algorithm on a mobile GPU delivers significant speedups. We study a […]

OpenCL

Sep, 11

Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications

GPUs have been proven very effective for structured applications. However, emerging data intensive applications are increasingly unstructured – irregular in their memory and control flow behavior over massive data sets. While the irregularity in these applications can result in poor workload balance among fine-grained threads or coarse-grained blocks, one can still observe dynamically formed pockets […]

CUDA

Sep, 11

Parallelized Seeded Region Growing using CUDA

This paper presents a novel method for parallelizing the seeded region growing (SRG) algorithm using Compute Unified Device Architecture (CUDA) technology, with intent to overcome the theoretical weakness of SRG algorithm of its computation time being directly proportional to the size of a segmented region. The segmentation performance of the proposed CUDA-based SRG is compared […]

CUDA

Sep, 11

Ray Traced Rendering Using GPGPU Devices

Ray tracing is a very popular way to draw 3-D scenes onto a 2-D image. The technique produces a very high degree of visual realism with regard to shadows, reflection, and refraction. The drawback of this technique is the fact that it is extremely computationally expensive. This expense has been a barrier to using ray […]

OpenCL

Sep, 11

Enhancing R with Advanced Compilation Tools and Methods

I describe an approach to compiling common idioms in R code directly to native machine code and illustrate it with several examples. Not only can this yield significant performance gains, but it allows us to use new approaches to computing in R. Importantly, the compilation requires no changes to R itself, but is done entirely […]

CUDA

Sep, 9

Parallel Multi-dimensional Range Query Processing with R-Trees on GPU

The general purpose computing on graphics processing unit (GP-GPU) has emerged as a new cost effective parallel computing paradigm in high performance computing research that enables large amount of data to be processed in parallel. Large scale scientific data intensive applications have been playing an important role in modern high performance computing research. A common […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-Core Architectures

Parallel Computation of Non-Bonded Interactions in Drug Discovery: Nvidia GPUs vs. Intel Xeon Phi

Parallel CYK Membership Test on GPUs

Analysis of GPU-based convolution for acoustic wave propagation modeling with finite differences: Fortran to CUDA-C step-by-step

Performance and Power Optimization of GPU Architectures for General-purpose Computing

HTML5 WebSocket protocol and its application to distributed computing

Pattern Matching in OpenCL: GPU vs CPU Energy Consumption on Two Mobile Chipsets

Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications

Parallelized Seeded Region Growing using CUDA

Ray Traced Rendering Using GPGPU Devices

Enhancing R with Advanced Compilation Tools and Methods

Parallel Multi-dimensional Range Query Processing with R-Trees on GPU

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)