high performance computing on graphics processing units: hgpu.org

Posts

Oct, 31

Low Latency Complex Event Processing on Parallel Hardware

Several application domains involve observing events, processing them, and reacting. This asks for a Complex Event Processing (CEP) engine in charge of interpreting, filtering, and combining primitive events that occur in the external environment, to identify higher level composite events, according to a set of rules written in an ad-hoc rule definition language. A key […]

CUDA

Oct, 31

Fast Speaker Diarization Using a High-Level Scripting Language

Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine "who spoke when" in an audio recording. While stateof-the-art in accuracy, this method is computationally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require […]

CUDA

Oct, 31

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation

Multi-core parallelism and accelerators are becoming common features of today’s computer systems, as they allow for computational power without sacrificing energy efficiency. Due to heterogeneity, tuning for each type of compute unit and adequate load balancing is essential. This paper proposes static and dynamic solutions for load balancing in the context of an application for […]

CUDA

Oct, 31

Environment Segmentation in Service Robotics

In the field of robotics a common problem is attempting to understand the world or environment in which the robot is operating. This is a common issue, as robots do not have an "intuitive" sense about its environment. Environment segmentation is a technique that is used to allow for the isolation of different parts of […]

CUDA

Oct, 31

High-performance software rasterization on GPUs

In this paper, we implement an efficient, completely software-based graphics pipeline on a GPU. Unlike previous approaches, we obey ordering constraints imposed by current graphics APIs, guarantee hole-free rasterization, and support multisample antialiasing. Our goal is to examine the performance implications of not exploiting the fixed-function graphics pipeline, and to discern which additional hardware support […]

CUDA

Oct, 31

Parallel implematation of flow and matching algorithms

In our work we present two parallel algorithms and their lock-free implementations using a popular GPU environment Nvidia CUDA. The first algorithm is the push-relabel method for the flow problem in grid graphs. The second is the cost scaling algorithm for the assignment problem in complete bipartite graphs.

CUDA

Oct, 30

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Recently, general purpose GPU (GPGPU) programming has spread rapidly after CUDA was first introduced to write parallel programs in high-level languages for NVIDIA GPUs. While a GPU exploits data parallelism very effectively, task-level parallelism is exploited as a multi-threaded program on a multicore CPU. For such a heterogeneous platform that consists of a multicore CPU […]

CUDA

Oct, 30

Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures

The real-time processing of the four meter Advanced Technology Solar Telescope (ATST) adaptive optics (AO) system with approximately 1750 sub-apertures and 1900 actuators requires massive parallel processing to complete the task. The parallel processing is harnessed with the addition of hardware accelerators such as Field Programmable Gate Array (FPGA) and Graphics Processing Unit (GPU). We […]

CUDA

Oct, 30

A scalable hybrid algorithm based on domain decomposition and algebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs

Several of the top ranked supercomputers are based on the hybrid architecture consisting of a large number of CPUs and GPUs. Very high performance has been obtained for problems with special structures, such as FFT-based image processing or N-body based particle calculations. However, for the class of problems described by partial differential equations discretized by […]

CUDA

Oct, 30

Efficient Implementation of the eta_T Pairing on GPU

Recently, efficient implementation of cryptographic algorithms on graphics processing units (GPUs) has attracted a lot of attention in the cryptologic research community. In this paper, we deal with efficient implementation of the $eta_T$ pairing on supersingular curves over finite fields of characteristics 3. We report the performance results of implementations on NVIDIA GTX 285, GTX […]

Oct, 30

Optimizing and Auto-tuning Belief Propagation on the GPU

A CUDA kernel will utilize high-latency local memory for storage when there are not enough registers to hold the required data or if the data is an array that is accessed using a variable index within a loop. However, accesses from local memory take longer than accesses from registers and shared memory, so it is […]

CUDA

Oct, 30

Effective Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs

In this work we discuss the benefits of using massively parallel architectures for the optimization of Virtual Screening methods. We empirically demonstrate that GPUs are well suited architecture for the acceleration of non-bonded interaction kernels, obtaining up to a 260 times sustained speedup compared to its sequential counterpart version.

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Low Latency Complex Event Processing on Parallel Hardware

Fast Speaker Diarization Using a High-Level Scripting Language

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation

Environment Segmentation in Service Robotics

High-performance software rasterization on GPUs

Parallel implematation of flow and matching algorithms

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures

A scalable hybrid algorithm based on domain decomposition and algebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs

Efficient Implementation of the eta_T Pairing on GPU

Optimizing and Auto-tuning Belief Propagation on the GPU

Effective Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)