high performance computing on graphics processing units: hgpu.org

Posts

Jan, 21

Reproducible and Accurate Matrix Multiplication for GPU Accelerators

Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectures, getting a bitwise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. In this paper, we address the problem of reproducibility in the context of matrix multiplication and propose an algorithm that yields […]

OpenCL

Jan, 21

GPU concurrency: Weak behaviours and programming assumptions

Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software. To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent […]

OpenCL

Jan, 21

A Novel Implementation of QuickHull Algorithm on the GPU

We present a novel GPU-accelerated implementation of the QuickHull algorithm for calculating convex hulls of planar point sets. We also describe a practical solution to demonstrate how to efficiently implement a typical Divide-and-Conquer algorithm on the GPU. We highly utilize the parallel primitives provided by the library Thrust such as the parallel segmented scan for […]

CUDA

Jan, 21

Opportunities for Nonvolatile Memory Systems in Extreme-Scale High Performance Computing

For extreme-scale high performance computing systems, system-wide power consumption has been identified as one of the key constraints moving forward, where the DRAM main memory systems account for about 30-50% of a node's overall power consumption. Moreover, as the benefits of device scaling for DRAM memory slow, it will become increasingly difficult to keep memory […]

Jan, 21

Global finite element matrix construction based on a CPU-GPU implementation

The finite element method (FEM) has several computational steps to numerically solve a particular problem, to which many efforts have been directed to accelerate the solution stage of the linear system of equations. However, the finite element matrix construction, which is also time-consuming for unstructured meshes, has been less investigated. The generation of the global […]

CUDA

Jan, 19

A Survey of Architectural Techniques For Improving Cache Power Efficiency

Modern processors are using increasingly larger sized on-chip caches. Also, with each CMOS technology generation, there has been a significant increase in their leakage energy consumption. For this reason, cache power management has become a crucial research issue in modern processor design. To address this challenge and also meet the goals of sustainable computing, researchers […]

Jan, 19

Harnessing Aspect Oriented Programming on GPU: Application to Warp-Level Parallelism (WLP)

Stochastic simulations involve multiple replications in order to build confidence intervals for their results, and Designs Of Experiments (DOEs) to explore their parameters set. In this paper, we propose Warp-Level Parallelism (WLP), a GPU-enabled solution to compute Multiple Replications In Parallel (MRIP) on GPUs (Graphics Processing Units). GPUs are intrinsically tuned to process efficiently the […]

CUDA

Jan, 19

Accelerating mahout on heterogeneous clusters using HadoopCL

MapReduce is a programming model capable of processing massive data in parallel across hundreds of computing nodes in a cluster. It hides many of the complicated details of parallel computing and provides a straightforward interface for programmers to adapt their algorithms to improve productivity. Many MapReduce-based applications have utilized the power of this model, including […]

OpenCL

Jan, 19

A fast marching method based back projection algorithm for photoacoustic tomography in heterogeneous media

This paper presents a numerical study on a fast marching method based back projection reconstruction algorithm for photoacoustic tomography in heterogeneous media. Transcranial imaging is used here as a case study. To correct for the phase aberration from the heterogeneity (i.e., skull), the fast marching method is adopted to compute the phase delay based on […]

CUDA

Jan, 19

Hybrid Multicore Algorithms for Some Semi-Numerical Applications and Graphs

The computing industry has undergone several paradigm shifts in the last few decades. Fueled by the need of faster computing, larger data and real time processing needs parallel computing has emerged as one of the dominant paradigms. Motivated by the success achieved in distributed computing models and the limitations faced by single core processors, parallel […]

CUDA

•

OpenCL

Jan, 19

Indexing of Spatiotemporal Trajectories for Efficient Distance Threshold Similarity Searches on the GPU

Applications in many domains search moving object trajectory databases. The distance threshold search finds all trajectories within a given distance of a query trajectory. We develop three GPU distance threshold search implementations that use indexing techniques significantly different from those used in CPU implementations. We determine experimentally under which conditions each approach performs well using […]

OpenCL

Jan, 16

CURFIL: Random Forests for Image Labeling on GPU

Random forests are popular classifiers for computer vision tasks such as image labeling or object detection. Learning random forests on large datasets, however, is computationally demanding. Slow learning impedes model selection and scientific research on image features. We present an open-source implementation that significantly accelerates both random forest learning and prediction for image labeling of […]

CUDA