high performance computing on graphics processing units: hgpu.org

Posts

Nov, 2

Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead

Sorting is a well-investigated topic in Computer Science in general and by now many efficient sorting algorithms for CPUs and GPUs have been developed. There is no swapping, paging, etc. available on GPUs to provide more virtual memory than physically available, thus if one wants to sort sequences that exceed GPU memory using the GPU […]

CUDA

Nov, 2

Linear algebra operators for GPU implementation of numerical algorithms

In this work, the emphasis is on the development of strategies to realize techniques of numerical computing on the graphics chip. In particular, the focus is on the acceleration of techniques for solving sets of algebraic equations as they occur in numerical simulation. We introduce a framework for the implementation of linear algebra operators on […]

OpenGL

Nov, 2

Improving Performance of Matrix Multiplication and FFT on GPU

In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the […]

CUDA

Nov, 2

Acceleration of finite-difference time-domain (FDTD) using graphics processor units (GPU)

The Finite-Difference Time-Domain (FDTD) method is used extensively in areas of microwave engineering and optics. However, FDTD runs too slow for some simulations to be practical, especially when run on standard desktop computers. The suitability of dedicated hardware for the acceleration of FDTD computations has been investigated. It is demonstrated that standard consumer Graphics Processor […]

OpenGL

Nov, 1

A control-structure splitting optimization for GPGPU

Control statements in a GPU program such as loops and branches pose serious challenges for the efficient usage of GPU resources because those control statements will lead to the serialization of threads and consequently ruin the occupancy of GPU, that is, the number of threads running concurrently. Unlike traditional vector processing units that are inside […]

CUDA

Nov, 1

GPU-assisted decoding of video samples represented in the YCoCg-R color space

Although pixel shaders were designed for the creation of programmable rendering effects, they can also be used as generic processing units for vector data. In this paper, attention is paid to an implementation of the YCoCg-R to RGB color space transform, as defined in the H.264/AVC Fidelity Range Extensions, by making use of pixel shaders. […]

Nov, 1

GPGPU: general purpose computation on graphics hardware

The graphics processor (GPU) on today’s commodity video cards has evolved into an extremely powerful and flexible processor. The latest graphics architectures provide tremendous memory bandwidth and computational horsepower, with fully programmable vertex and pixel processing units that support vector operations up to full IEEE floating point precision. High level languages have emerged for graphics […]

Nov, 1

A GPU accelerated storage system

Massively multicore processors, like, for example, Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign […]

CUDA

Nov, 1

OpenVIDIA: parallel GPU computer vision

Graphics and vision are approximate inverses of each other: ordinarily Graphics Processing Units (GPUs) are used to convert "numbers into pictures" (i.e. computer graphics). In this paper, we propose using GPUs in approximately the reverse way: to assist in "converting pictures into numbers" (i.e. computer vision). The OpenVIDIA project uses single or multiple graphics cards […]

OpenGL

Nov, 1

Real-time particle systems on the GPU in dynamic environments

Abstract unavailable

Nov, 1

GPU-ClustalW: Using Graphics Hardware to Accelerate Multiple Sequence Alignment

Molecular Biologists frequently compute multiple sequence alignments (MSAs) to identify similar regions in protein families. However, aligning hundreds of sequences by popular MSA tools such as ClustalW requires several hours on sequential computers. Due to the rapid growth of biological sequence databases biologists have to compute MSAs in a far shorter time. In this paper […]

Nov, 1

GPU Simulation and Rendering of Volumetric Effects for Computer Games and Virtual Environments

Abstract As simulation and rendering capabilities continue to increase, volumetric effects like smoke, fire or explosions will be frequently encountered in computer games and virtual environments. In this paper, we present techniques for the visual simulation and rendering of such effects that keep up with the demands for frame rates imposed by such environments. This […]

OpenGL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead

Linear algebra operators for GPU implementation of numerical algorithms

Improving Performance of Matrix Multiplication and FFT on GPU

Acceleration of finite-difference time-domain (FDTD) using graphics processor units (GPU)

A control-structure splitting optimization for GPGPU

GPU-assisted decoding of video samples represented in the YCoCg-R color space

GPGPU: general purpose computation on graphics hardware

A GPU accelerated storage system

OpenVIDIA: parallel GPU computer vision

Real-time particle systems on the GPU in dynamic environments

GPU-ClustalW: Using Graphics Hardware to Accelerate Multiple Sequence Alignment

GPU Simulation and Rendering of Volumetric Effects for Computer Games and Virtual Environments

Recent source codes

XaaS containers

microSYCL: SYCL micro-benchmarks repository

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)