high performance computing on graphics processing units: hgpu.org

Posts

Mar, 13

GPU Objects

Points, lines, and polygons have been the fundamental primitives in graphics. Graphics hardware is optimized to handle them in a pipeline. Other objects are converted to these primitives before rendering. Programmable GPUs have made it possible to introduce a wide class of computations on each vertex and on each fragment. In this paper, we outline […]

OpenGL

Mar, 12

Improving the Efficiency of GPU Clusters

If you perceive more than a little excitement around the topic of Graphic Processing Units (GPUs) in High-Performance Computing (HPC), it’s for pretty good reason. HPC is all about performance after all, and it’s not every day that a new technology promises an order of magnitude boost in processing power. A variety of new GPU […]

Mar, 12

Efficient Spatial Binning on the GPU

We present a new technique for sorting data into spatial bins or buckets using a graphics processing unit (GPU). Our method takes unsorted point data as input and scatters the points, in sorted order, into a set of bins. This is a key operation in the construction of spatial data structures, which are essential for […]

Mar, 12

GPU Octrees and Optimized Search

Octree structures are widely used in graphic applications to accelerate the computation of geometric proximity relations. This data strucutre is fundamental for game engine architectures for a correct scene management and culling process. With the increasing power of graphics hardware, processing tasks are progressively ported of to those architectures. However, octrees are essentially hierarchical structures, […]

CUDA

Mar, 12

Acceleration of a CFD Code with a GPU

The CFD code Overflow includes as one of its solver options a quasi-SSOR algorithm. This is a fairly small piece of code but it accounts for a significant portion of the total computational time. This paper studies some of the issues in accelerating the code by use of a GPU. The algorithm needs to be […]

CUDA

Mar, 12

Empowering Visual Categorization With the GPU

Visual categorization is important to manage large collections of digital images and video, where textual metadata is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe drawback of this model is its high computational cost. As the […]

CUDA

Mar, 12

GPU Computing with Orientation Maps for Extracting Local Invariant Features

Local invariant features have been widely used as fundamental elements for image matching and object recognition. Although dense sampling of local features is useful in achieving an improved performance in image matching and object recognition, it results in increased computational costs for feature extraction. The purpose of this paper is to develop fast computational techniques […]

CUDA

Mar, 12

GEMM on a GPU

The Matrix-Matrix Multiplication is the most important operation in High-Performance Linear Algebra. If your application can cast most of its computation in terms of the level-3 BLAS operations, the application can achieve very high-performance levels. For this reason the Basic Linear Algebra Subprograms(BLAS) tend to heavily optimize this operation. With Graphics Processing Units(GPUs) on the […]

CUDA

Mar, 12

To GPU Synchronize or Not GPU Synchronize?

The graphics processing unit (GPU) has evolved from being a fixed-function processor with programmable stages into a programmable processor with many fixed-function components that deliver massive parallelism. By modifying the GPU’s stream processor to support “general-purpose computation” on the GPU (GPGPU), applications that perform massive vector operations can realize many orders-of-magnitude improvement in performance over […]

CUDA

Mar, 12

GBOOST : A GPU-based tool for detecting gene-gene interactions in genome-wide case control studies

MOTIVATION: Collecting millions of genetic variations is feasible with the advanced genotyping technology. With a huge amount of genetic variations data in hand, developing efficient algorithms to carry out the gene-gene interaction analysis in a timely manner has become one of the key problems in Genome-Wide Association Studies (GWAS). Boolean operation based screening and testing […]

Mar, 12

Fast Optimal Mass Transport for Dynamic Active Contour Tracking on the GPU

In computational vision, visual tracking remains one of the most challenging problems due to noise, clutter, occlusion, and dynamic scenes. No one technique has yet managed to solve this problem completely, but those that employ control- theoretic filtering techniques have proven to be quite successful. In this work, we extend one such technique by Niethammer […]

Mar, 11

Stereo depth with a Unified Architecture GPU

This paper describes how the calculation of depth from stereo images was accelerated using a GPU. The Compute Unified Device Architecture (CUDA) from NVIDIA was employed in novel ways to compute depth using BT cost matching and the semi-global matching algorithm. The challenges of mapping a sequential algorithm to a massively parallel thread environment and […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPU Objects

Improving the Efficiency of GPU Clusters

Efficient Spatial Binning on the GPU

GPU Octrees and Optimized Search

Acceleration of a CFD Code with a GPU

Empowering Visual Categorization With the GPU

GPU Computing with Orientation Maps for Extracting Local Invariant Features

GEMM on a GPU

To GPU Synchronize or Not GPU Synchronize?

GBOOST : A GPU-based tool for detecting gene-gene interactions in genome-wide case control studies

Fast Optimal Mass Transport for Dynamic Active Contour Tracking on the GPU

Stereo depth with a Unified Architecture GPU

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)