high performance computing on graphics processing units: hgpu.org

Posts

Dec, 16

Efficient XML Path Filtering Using GPUs

Publish-subscribe (pub-sub) systems present the state of the art in information dissemination to multiple users. Current XML-based pub-sub systems provide users with considerable exibility allowing the formulation of complex queries on the content as well as the structure of the streaming messages. Messages that contain one or more matches for a given user profile (query) […]

CUDA

Dec, 16

A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers

We examine the problem of solving many thousands of small dense linear algebra factorizations simultaneously on Graphics Processing Units (GPUs). We are interested in problems ranging from several hundred of rows and columns to 4×4 matrices. Problems of this size are common, especially in signal processing. However, they have received very little attention from current […]

CUDA

Dec, 16

Improving GPU Robustness by Making Use of Faulty Parts

With hundreds of processing units in current state-of-the-art graphics processing units (GPUs), the probability that one or more processing units fail due to permanent faults, during fabrication or post deployment, increases drastically. In our experiments we found that the loss of a single streaming multiprocessor (SM) in an 8-SM GPU resulted in as much as […]

Dec, 16

Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming

Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them […]

CUDA

Dec, 16

Implementation and Evaluation of Scientific Simulations on High Performance Computing Architectures

Computational Science is field of study in which computers are used to solve challenging scientific problems. Real or imaginary world scientific problems are converted into mathematical models and solved using numerical analysis techniques with the help of high performance computing famously called scientific computing. As computer technology is advancing rapidly, computers are becoming increasingly powerful […]

CUDA

Dec, 16

Affine Vector Cache for memory bandwidth savings

Preserving memory locality is a major issue in highly-multithreaded architectures such as GPUs. These architectures hide latency by maintaining a large number of threads in flight. As each thread needs to maintain a private working set, all threads collectively put tremendous pressure on on-chip memory arrays, at significant cost in area and power. We show […]

CUDA

Dec, 15

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Single-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into so-called warps to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of […]

CUDA

Dec, 15

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more dif?cult. Current approaches rely on programmers […]

CUDA

Dec, 15

A GPU-based Approximate SVD Algorithm

Approximation of matrices using the Singular Value Decomposition (SVD) plays a central role in many science and engineering applications. However, the computation cost of an exact SVD is prohibitively high for very large matrices. In this paper, we describe a GPU-based approximate SVD algorithm for large matrices. Our method is based on the QUIC-SVD introduced […]

CUDA

Dec, 15

GPU Algorithms for the Estimation of Environmental Models Based on Large Datasets

Statistical environmental models are computationally intensive due to the high dimension of the data, both in space and time, and due to the inferential techniques required for parameter estimation and spatial prediction. In particular, the complexity of these procedures is related to matrix operations (inversion, solution of linear systems, factorization) involving large matrices. Recently, much […]

CUDA

Dec, 15

GPU Collision Detection in Conformal Geometric Space

We derive a conformal algebra treatment unifying all types of collisions among points, vectors, areas (defined by bivectors and trivectors) and 3D solid objects (defined by trivectors and quadvectors), based in a reformulation of collision queries from R^3 to conformal R^4,1 space. The algebraic formulation in this 5D space is then implemented in GPU to […]

CUDA

Dec, 15

Performance in GPU Architectures: Potentials and Distances

GPUs can execute up to one TFLOPs at their peak performance. This peak performance, however, is rarely reached as a result of resource underutilization. Three parameters contribute to this inefficiency: branch divergence, memory access delays and limited workload parallelism. To this end we suggest machine models to estimate performance gain potentials obtainable by eliminating each […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Efficient XML Path Filtering Using GPUs

A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers

Improving GPU Robustness by Making Use of Faulty Parts

Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming

Implementation and Evaluation of Scientific Simulations on High Performance Computing Architectures

Affine Vector Cache for memory bandwidth savings

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

A GPU-based Approximate SVD Algorithm

GPU Algorithms for the Estimation of Environmental Models Based on Large Datasets

GPU Collision Detection in Conformal Geometric Space

Performance in GPU Architectures: Potentials and Distances

Recent source codes

XaaS containers

microSYCL: SYCL micro-benchmarks repository

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)