Posts
Sep, 23
Simple optimizations for an applicative array language for graphics processors
Graphics processors (GPUs) are highly parallel devices that promise high performance, and they are now flexible enough to be used for general-purpose computing. A programming language based on implicitly data-parallel collective array operations can permit high-level, effective programming of GPUs. I describe three optimizations for such a language: automatic use of GPU shared memory cache, […]
Sep, 23
Mathematical limits of parallel computation for embedded systems
Embedded systems are designed to perform a specific set of tasks, and are frequently found in mobile, power-constrained environments. There is growing interest in the use of parallel computation as a means to increase performance while reducing power consumption. In this paper, we highlight fundamental limits to what can and cannot be improved by parallel […]
Sep, 23
HHT-based time-frequency analysis method for biomedical signal applications
Fourier transform, wavelet transformation, and Hilbert-Huang transformation (HHT) can be used to discuss the frequency characteristics of linear and stationary signals, the time-frequency features of linear and non-stationary signals, the time-frequency features of non-linear and non-stationary signals, respectively [1-6]. HHT is a combination of empirical mode decomposition (EMD) and Hilbert spectral analysis. EMD uses the […]
Sep, 23
The International Exascale Software Project roadmap
Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a […]
Sep, 23
Compact data structure and scalable algorithms for the sparse grid technique
The sparse grid discretization technique enables a compressed representation of higher-dimensional functions. In its original form, it relies heavily on recursion and complex data structures, thus being far from well-suited for GPUs. In this paper, we describe optimizations that enable us to implement compression and decompression, the crucial sparse grid algorithms for our application, on […]
Sep, 23
Colored stochastic shadow maps
This paper extends the stochastic transparency algorithm that models partial coverage to also model wavelength-varying transmission. It then applies this to the problem of casting shadows between any combination of opaque, colored transmissive, and partially covered (i.e., ?-matted) surfaces in a manner compatible with existing hardware shadow mapping techniques. Colored Stochastic Shadow Maps have a […]
Sep, 23
Unstructured grid applications on GPU: performance analysis and improvement
Performance of applications running on GPUs is mainly affected by hardware occupancy and global memory latency. Scientific applications that rely on analysis using unstructured grids could benefit from the high performance capabilities provided by GPUs, however, its memory access pattern and algorithm limit the potential benefits. In this paper we analyze the algorithm for unstructured […]
Sep, 23
Orchestration by approximation: mapping stream programs onto multicore architectures
We present a novel 2-approximation algorithm for deploying stream graphs on multicore computers and a stream graph transformation that eliminates bottlenecks. The key technical insight is a data rate transfer model that enables the computation of a "closed form", i.e., the data rate transfer function of an actor depending on the arrival rate of the […]
Sep, 23
Quantifying NUMA and contention effects in multi-GPU systems
As system architects strive for increased density and power efficiency, the traditional compute node is being augmented with an increasing number of graphics processing units (GPUs). The integration of multiple GPUs per node introduces complex performance phenomena including non-uniform memory access (NUMA) and contention for shared system resources. Utilizing the Keeneland system, this paper quantifies […]
Sep, 22
Register packing for cyclic reduction: a case study
We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared-memory bandwidth bottlenecks and step-efficiency. We […]
Sep, 22
On-the-fly elimination of dynamic irregularities for GPU computing
The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to […]
Sep, 22
Reducing branch divergence in GPU programs
Branch divergence has a significant impact on the performance of GPU programs. We propose two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch […]