high performance computing on graphics processing units: hgpu.org

Posts

Jan, 10

Dense optical flow by iterative local window registration

We study dense optical flow estimation using iterative registration of local window, also known as iterative Lucas-Kanade (LK) [B. Lucas et al, 1981]. We show that the usual iterative-warping scheme encounters divergence problems and propose a modified scheme with better behavior. It yields good results with a much lower cost than the exact dense LK […]

CUDA

Jan, 10

CUDA-Based Radiative Transfer Method with Application to the EM Scattering from a Two-Layer Canopy Model

In step with the great efforts to find out the scattering contributions of a large number of samples in the vegetation canopy, intensive computational burden occurs and obviously lames the application of the traditional serial algorithm on the basis of the radiative transfer theory to evaluate the electromagnetic (EM) scattering from vegetations. Nevertheless, the Compute […]

CUDA

Jan, 10

Connected component labeling on a 2D grid using CUDA

Connected component labeling is an important but computationally expensive operation required in many fields of research. The goal in the present work is to label connected components on a 2D binary map. Two different iterative algorithms for doing this task are presented. The first algorithm (Row-Col Unify) is based upon the directional propagation labeling, whereas […]

CUDA

Jan, 9

Tapping the supercomputer under your desk: solving dynamic equilibrium models with graphics processors?

This paper shows how to build algorithms that use graphics processing units (GPUs) installed in most modern computers to solve dynamic equilibrium models in economics. In particular, we rely on the compute unified device architecture (CUDA) of NVIDIA GPUs. We illustrate the power of the approach by solving a simple real business cycle model with […]

CUDA

Jan, 9

Parallel Prefix Sum (Scan) with CUDA

Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naive algorithm and proceed through more advanced techniques to […]

CUDA

Jan, 9

A study on tetrahedron-based inhomogeneous Monte Carlo optical simulation

Monte Carlo (MC) simulation is widely recognized as a gold standard in biophotonics for its high accuracy. Here we analyze several issues associated with tetrahedron-based optical Monte Carlo simulation in the context of TIM-OS, MMCM, MCML, and CUDAMCML in terms of accuracy and efficiency. Our results show that TIM-OS has significant better performance in the […]

CUDA

Jan, 9

Real-time object detection on CUDA

The aim of the research described in this article is to accelerate object detection in images and video sequences using graphics processors. It includes algorithmic modifications and adjustments of existing detectors, constructing variants of efficient implementations and evaluation comparing with efficient implementations on the CPUs. This article focuses on detection by statistical classifiers based on […]

CUDA

Jan, 9

Evaluation and tuning of the Level 3 CUBLAS for graphics processors

The increase in performance of the last generations of graphics processors (GPUs) has made this class of platform a coprocessing tool with remarkable success in certain types of operations. In this paper we evaluate the performance of the Level 3 operations in CUBLAS, the implementation of BIAS for NVIDIA GPUs with unified architecture. From this […]

CUDA

Jan, 9

Parallel programming for multimedia applications

Computing capabilities are continuing to increase with the availability of multi core and many core processors. The wide availability of multi core processors has made parallel programming possible for end user applications running on desktops, workstations, and mobile devices. While parallel hardware has become common, software that exploits parallel capabilities is just beginning to take […]

CUDA

Jan, 9

A new approach to the lattice Boltzmann method for graphics processing units

Emerging many-core processors, like CUDA capable nVidia GPUs, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since the global memory for graphic devices shows high latency and LBM is data intensive, the memory access pattern is an important issue for achieving good performances. Whenever possible, global memory loads and […]

CUDA

Jan, 9

High-throughput bayesian computing machine with reconfigurable hardware

We use reconfigurable hardware to construct a high throughput Bayesian computing machine (BCM) capable of evaluating probabilistic networks with arbitrary DAG (directed acyclic graph) topology. Our BCM achieves high throughput by exploiting the FPGA’s distributed memories and abundant hardware structures (such as long carry-chains and registers), which enables us to 1) develop an innovative memory […]

CUDA

Jan, 9

Improving the Performance of Hyperspectral Image and Signal Processing Algorithms Using Parallel, Distributed and Specialized Hardware-Based Systems

Advances in sensor technology are revolutionizing the way remotely sensed data is collected, managed and analyzed. The incorporation of latest-generation sensors to airborne and satellite platforms is currently producing a nearly continual stream of high-dimensional data, and this explosion in the amount of collected information has rapidly created new processing challenges. For instance, hyperspectral signal […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Dense optical flow by iterative local window registration

CUDA-Based Radiative Transfer Method with Application to the EM Scattering from a Two-Layer Canopy Model

Connected component labeling on a 2D grid using CUDA

Tapping the supercomputer under your desk: solving dynamic equilibrium models with graphics processors?

Parallel Prefix Sum (Scan) with CUDA

A study on tetrahedron-based inhomogeneous Monte Carlo optical simulation

Real-time object detection on CUDA

Evaluation and tuning of the Level 3 CUBLAS for graphics processors

Parallel programming for multimedia applications

A new approach to the lattice Boltzmann method for graphics processing units

High-throughput bayesian computing machine with reconfigurable hardware

Improving the Performance of Hyperspectral Image and Signal Processing Algorithms Using Parallel, Distributed and Specialized Hardware-Based Systems

Recent source codes

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

LC Framework

pplx-garden: Perplexity open source garden for inference technology

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

OpScanner

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Most viewed papers (last 30 days)