Posts
Apr, 27
The Case for Higher Computational Density in the Memory-Bound FDTD Method within Multicore Environments
It is argued here that more accurate though more compute-intensive alternate algorithms to certain computational methods which are deemed too inefficient and wasteful when implemented within serial codes can be more efficient and cost-effective when implemented in parallel codes designed to run on today’s multicore and many-core environments. This argument is most germane to methods […]
Apr, 27
Matrix Multiplication with CUDA – A basic introduction to the CUDA programming model
We use the example of Matrix Multiplication to introduce the basics of GPU computing in the CUDA environment. It is assumed that the student is familiar with C programming, but no other background is assumed. The goal of this module is to show the student how to offload parallel computations to the graphics card, when […]
Apr, 25
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations
In this paper, we address efficient sparse matrix-vector multiplication for matrices arising from structured grid problems with high degrees of freedom at each grid node. Sparse matrix-vector multiplication is a critical step in the iterative solution of sparse linear systems of equations arising in the solution of partial differential equations using uniform grids for discretization. […]
Apr, 25
Radio Astronomy Beam Forming on Many-Core Architectures
Traditional radio telescopes use large steel dishes to observe radio sources. The largest radio telescope in the world, LOFAR, uses tens of thousands of fixed, omnidirectional antennas instead, a novel design that promises ground-breaking research in astronomy. Where traditional telescopes use custom-built hardware, LOFAR uses software to do signal processing in real time. This leads […]
Apr, 25
An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs
This paper presents a novel work-distribution strategy for GPUs, that efficiently convolves radio-telescope data onto a grid, one of the most time-consuming processing steps to create a sky image. Unlike existing work-distribution strategies, this strategy keeps the number of device-memory accesses low, without incurring the overhead from sorting or searching within telescope data. Performance measurements […]
Apr, 25
Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU Architectures
As parallel execution platforms continue to proliferate, there is a growing need for real-time introspection tools to provide insight into platform behavior for performance debugging, correctness checks, and to drive effective resource management schemes. To address this need, we present the Lynx dynamic instrumentation system. Lynx provides the capability to write instrumentation routines that are […]
Apr, 25
Polymer Field-Theory Simulations on Graphics Processing Units
We report the first CUDA graphics-processing-unit (GPU) implementation of the polymer field-theoretic simulation framework for determining fully fluctuating expectation values of equilibrium properties for periodic and select aperiodic polymer systems. Our implementation is suitable both for self-consistent field theory (mean-field) solutions of the field equations, and for fully fluctuating simulations using the complex Langevin approach. […]
Apr, 25
The Bones Source-to-Source Compiler Manual
Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from programmers. To address the challenge of parallel programming, we introduce […]
Apr, 25
Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing
The Parboil benchmarks are a set of throughput computing applications useful for studying the performance of throughput computing architecture and compilers. The name comes from the culinary term for a partial cooking process, which represents our belief that useful throughput computing benchmarks must be "cooked", or preselected to implement a scalable algorithm with fine-grained parallel […]
Apr, 25
WebCL for Hardware-Accelerated Web Applications
Mobile devices, such as smartphones and tablets, now run full feature browsers capable of handling rich media and web content. The emergence of HTML5 makes the browser an ever more attractive platform for application developers. In addition, improvements in JavaScript engines are further shrinking the performance gap between native applications, typically written in C and […]
Apr, 25
Comparison of Different Parallel Implementaions of the 2+1-Dimensional KPZ Model and the 3-Dimensional KMC Model
We show that efficient simulations of the Kardar-Parisi-Zhang interface growth in 2 + 1 dimensions and of the 3-dimensional Kinetic Monte Carlo of thermally activated diffusion can be realized both on GPUs and modern CPUs. In this article we present results of different implementations on GPUs using CUDA and OpenCL and also on CPUs using […]
Apr, 25
Paraiso : An Automated Tuning Framework for Explicit Solvers of Partial Differential Equations
We propose Paraiso, a domain specific language embedded in functional programming language Haskell, for automated tuning of explicit solvers of partial differential equations (PDEs) on GPUs as well as multicore CPUs. In Paraiso, one can describe PDE solving algorithms succinctly using tensor equations notation. Hydrodynamic properties, interpolation methods and other building blocks are described in […]