Posts
Jan, 23
Taming the complexities of the C11 and OpenCL memory models
We study how the C11 memory model can be simplified and how it can be extended. Our first contribution is to propose a mild strengthening of the model that enables the rules pertaining to sequentially-consistent (SC) operations to be significantly simplified. We eliminate one of the total orders that candidate executions must range over, leading […]
Jan, 23
Can Portability Improve Performance? An Empirical Study of Parallel Graph Analytics
Due to increasingly large datasets, graph analytics – traversals, all-pairs shortest path computations, centrality measures, etc. – are becoming the focus of high-performance computing (HPC). Because HPC is currently dominated by many-core architectures (both CPUs and GPUs), new graph processing solutions have to be defined to efficiently use such computing resources. Prior work focuses on […]
Jan, 23
Gunrock: A High-Performance Graph Processing Library on the GPU
For large-scale graph analytics on the GPU, the irregularity of data access and control flow and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system, uses a high-level bulk-synchronous abstraction with traversal and computation steps, designed specifically for the GPU. Gunrock couples high […]
Jan, 21
Reproducible and Accurate Matrix Multiplication for GPU Accelerators
Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectures, getting a bitwise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. In this paper, we address the problem of reproducibility in the context of matrix multiplication and propose an algorithm that yields […]
Jan, 21
GPU concurrency: Weak behaviours and programming assumptions
Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software. To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent […]
Jan, 21
A Novel Implementation of QuickHull Algorithm on the GPU
We present a novel GPU-accelerated implementation of the QuickHull algorithm for calculating convex hulls of planar point sets. We also describe a practical solution to demonstrate how to efficiently implement a typical Divide-and-Conquer algorithm on the GPU. We highly utilize the parallel primitives provided by the library Thrust such as the parallel segmented scan for […]
Jan, 21
Opportunities for Nonvolatile Memory Systems in Extreme-Scale High Performance Computing
For extreme-scale high performance computing systems, system-wide power consumption has been identified as one of the key constraints moving forward, where the DRAM main memory systems account for about 30-50% of a node's overall power consumption. Moreover, as the benefits of device scaling for DRAM memory slow, it will become increasingly difficult to keep memory […]
Jan, 21
Global finite element matrix construction based on a CPU-GPU implementation
The finite element method (FEM) has several computational steps to numerically solve a particular problem, to which many efforts have been directed to accelerate the solution stage of the linear system of equations. However, the finite element matrix construction, which is also time-consuming for unstructured meshes, has been less investigated. The generation of the global […]
Jan, 19
A Survey of Architectural Techniques For Improving Cache Power Efficiency
Modern processors are using increasingly larger sized on-chip caches. Also, with each CMOS technology generation, there has been a significant increase in their leakage energy consumption. For this reason, cache power management has become a crucial research issue in modern processor design. To address this challenge and also meet the goals of sustainable computing, researchers […]
Jan, 19
Harnessing Aspect Oriented Programming on GPU: Application to Warp-Level Parallelism (WLP)
Stochastic simulations involve multiple replications in order to build confidence intervals for their results, and Designs Of Experiments (DOEs) to explore their parameters set. In this paper, we propose Warp-Level Parallelism (WLP), a GPU-enabled solution to compute Multiple Replications In Parallel (MRIP) on GPUs (Graphics Processing Units). GPUs are intrinsically tuned to process efficiently the […]
Jan, 19
Accelerating mahout on heterogeneous clusters using HadoopCL
MapReduce is a programming model capable of processing massive data in parallel across hundreds of computing nodes in a cluster. It hides many of the complicated details of parallel computing and provides a straightforward interface for programmers to adapt their algorithms to improve productivity. Many MapReduce-based applications have utilized the power of this model, including […]
Jan, 19
A fast marching method based back projection algorithm for photoacoustic tomography in heterogeneous media
This paper presents a numerical study on a fast marching method based back projection reconstruction algorithm for photoacoustic tomography in heterogeneous media. Transcranial imaging is used here as a case study. To correct for the phase aberration from the heterogeneity (i.e., skull), the fast marching method is adopted to compute the phase delay based on […]