high performance computing on graphics processing units: hgpu.org

Posts

Mar, 8

Offloading Region Matching of Data Distribution Management with CUDA

Data distribution management (DDM) aims to reduce the transmission of irrelevant data between High Level Architecture (HLA) compliant simulators by taking their interesting regions into account (i.e. region matching). In a large-scale simulation, computation intensive region matching would have a direct impact on the simulation performance. To deal with the high computation cost of region […]

CUDA

Mar, 8

Preliminary implementation of VQ image coding using GPGPU

GPGPU (general purpose computing on graphic processing unit) attracts a great deal of attention, that is used for general-purpose computations like numerical calculations as well as graphic processing. In this paper, as an example of hierarchical clustering algorithms, we evaluate PNN (pairwise nearest neighbor) on GPUs by using CUDA (compute unified device architecture). We also […]

CUDA

Mar, 8

Frame-based parallelization of MPEG-4 on compute unified device architecture (CUDA)

Due to its object based nature, flexible features and provision for user interaction, MPEG-4 encoder is highly suitable for parallelization. The most critical and time-consuming operation of encoder is motion estimation. Nvidia’s general-purpose graphical processing unit (GPGPU) architecture allows for a massively parallel stream processor model at a very cheap price (in a few thousands […]

CUDA

Mar, 7

IP routing processing with graphic processors

Throughput and programmability have always been the central, but generally conflicting concerns for modern IP router designs. Current high performance routers depend on proprietary hardware solutions, which make it difficult to adapt to ever-changing network protocols. On the other hand, software routers offer the best flexibility and programmability, but could only achieve a throughput one […]

CUDA

Mar, 7

Application-guided tool development for architecturally diverse computation

Architecturally diverse computation exploits non-traditional computing platforms (e.g., field-programmable gate arrays, graphics processors, heterogeneous chip multiprocessors) to execute user applications. We have designed the Auto-Pipe tool set with the goal of easing the task of developing applications for architecturally diverse systems. Prior to and during the course of Auto-Pipe’s design, we have developed a number […]

CUDA

Mar, 7

Non-blocking programming on multi-core graphics processors: (extended asbtract)

This paper investigates the synchronization power of coalesced memory accesses, a family of memory access mechanisms introduced in recent large multicore architectures like the CUDA graphics processors. We first design three memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models […]

CUDA

Mar, 7

CUDA-based AES parallelization with fine-tuned GPU memory utilization

Current Graphics Processing Unit (GPU) presents large potentials in speeding up computationally intensive data parallel applications over traditional parallelization approaches since there are much more hardware threads inside GPUs than the computational cores available to common CPU threads. NVIDIA developed a generic GPU programming platform, CUDA, which allows programmers to utilize GPU through C programming […]

CUDA

Mar, 7

Designing scalable many-core parallel algorithms for min graphs using CUDA

Removing redundant edges on a large graph is a fundamental problem in many practical applications such as verification of real-time systems and network routing. In this paper, we present the designs of scalable and efficient parallel algorithms for multiple many-core GPU devices using CUDA. Our algorithms expose substantial fine-grained parallelism while maintaining minimal global communication. […]

CUDA

Mar, 7

A tile-based parallel Viterbi algorithm for biological sequence alignment on GPU with CUDA

The Viterbi algorithm is the compute-intensive kernel in Hidden Markov Model (HMM) based sequence alignment applications. In this paper, we investigate extending several parallel methods, such as the wave-front and streaming methods for the Smith-Waterman algorithm, to achieve a significant speed-up on a GPU. The wave-front method can take advantage of the computing power of […]

CUDA

Mar, 7

Efficient parallel algorithms for maximum-density segment problem

One of the fundamental problems involving DNA sequences is to find high density segments of certain widths, for example, those regions with intensive guanine and cytosine (GC). Formally, given a sequence, each element of which has a value and a width, the maximum-density segment problem asks for the segment with the maximum density while satisfying […]

CUDA

Mar, 7

Fast implementation of Wyner-Ziv Video codec using GPGPU

In this paper, we report a fast implementation of Wyner-Ziv video decoder using general-purpose computing on graphics processing units (GPGPU). Despite of its many advantages, Wyner-Ziv video coding has a problem of huge decoding complexity. Since Slepian-Wolf decoding with rate adaptive LDPC accumulate code takes up more than 90% of entire Wyner-Ziv video decoding complexity, […]

CUDA

Mar, 7

Object-oriented stream programming using aspects

High-performance parallel programs that efficiently utilize heterogeneous CPU+GPU accelerator systems require tuned coordination among multiple program units. However, using current programming frameworks such as CUDA leads to tangled source code that combines code for the core computation with that for device and computational kernel management, data transfers between memory spaces, and various optimizations. In this […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Offloading Region Matching of Data Distribution Management with CUDA

Preliminary implementation of VQ image coding using GPGPU

Frame-based parallelization of MPEG-4 on compute unified device architecture (CUDA)

IP routing processing with graphic processors

Application-guided tool development for architecturally diverse computation

Non-blocking programming on multi-core graphics processors: (extended asbtract)

CUDA-based AES parallelization with fine-tuned GPU memory utilization

Designing scalable many-core parallel algorithms for min graphs using CUDA

A tile-based parallel Viterbi algorithm for biological sequence alignment on GPU with CUDA

Efficient parallel algorithms for maximum-density segment problem

Fast implementation of Wyner-Ziv Video codec using GPGPU

Object-oriented stream programming using aspects

Recent source codes

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Hercules: A Compiler for Productive Programming of Heterogeneous Systems

Celerity Runtime: High-level C++ for Accelerator Clusters

wgpy: WebGL accelerated numpy-compatible array library for web browser

Microbenchmarking OpenMP target offload with Catch2

SUperman: Highly Efficient Permanent Computation Library

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

Most viewed papers (last 30 days)