Posts
Jul, 31
The Promises of Hybrid Hexagonal/Classical Tiling for GPU
Time-tiling is necessary for efficient execution of iterative stencil computations. But the usual hyper-rectangular tiles cannot be used because of positive/negative dependence distances along the stencil’s spatial dimensions. Several prior efforts have addressed this issue. However, known techniques trade enhanced data reuse for other causes of inefficiency, such as unbalanced parallelism, redundant computations, or increased […]
Jul, 31
Opportunities for Heterogeneous CPUGPU Task Scheduling
It is common to exploit the co-processors of modern computer systems to speed up computations which were traditionally done on the CPU. While this is already very common for computer graphical and scientific applications, there is no reason why this cannot be extended to many different kinds of applications. In this paper we study the […]
Jul, 31
GPU-based Streaming Algorithm for High-Resolution Cloth Simulation
We present a GPU-based streaming algorithm to perform high-resolution and accurate cloth simulation. We map all the components of cloth simulation pipeline, including time integration, collision detection, collision response, and velocity updating to GPU-based kernels and data structures. Our algorithm perform intra-object and inter-object collisions, handles contacts and friction, and is able to accurately simulate […]
Jul, 31
Fast Computation of Computer-generated Hologram Using Xeon Phi Coprocessors
Using parallel computing is an effective way to accelerate computer-generated hologram (CGH) calculation. In this paper, we implemented various CGH algorithms on Intel Xeon Phi Coprocessors. In the best case, we succeeded the CGH calculations 12-times faster than a CPU.
Jul, 31
Graphics Processing Unit acceleration of the Random Phase Approximation in the projector augmented wave method
The Random Phase Approximation (RPA) for correlation energy in the grid-based projector augmented wave (gpaw) code is accelerated by porting to the Graphics Processing Unit (GPU) architecture. The acceleration is achieved by grouping independent vectors/matrices and transforming the implementation from being memory bound to being computation/latency bound. With this approach, both the CPU and GPU […]
Jul, 30
Image Processing with CUDA
This thesis puts to the test the power of parallel computing on the GPU against the massive computations needed in image processing of large images. The GPU has long been used to accelerate 3D applications. With the advent of high level programmable interfaces, programming to the GPU is simplified and is being used to accelerate […]
Jul, 30
Domain Specific Languages for High Performance Computing
High Performance Computing (HPC) relies completely on complex parallel, heterogeneous architectures and distributed systems which are hard and error-prone to exploit, even for HPC specialists. Further and further knowledge on runtime systems, dependency tracking, memory transaction optimization and many other techniques are a must-have requirement to produce high quality software capable of exploiting every single […]
Jul, 30
Counting and Occurrence Sort for GPUs using an Embedded Language
This paper investigates two sorting algorithms: counting sort and a variation, occurrence sort, which also removes duplicate elements, and examines their suitability for running on the GPU. The duplicate removing variation turns out to have a natural functional, dataparallel implementation which makes it particularly interesting for GPUs. The algorithms are implemented in Obsidian, a high-level […]
Jul, 30
Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general […]
Jul, 30
Scalable Dense Linear Algebra on Heterogeneous Hardware
Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore […]
Jul, 29
LU Factorization with Partial Pivoting for a Multicore System with Accelerators
LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion […]
Jul, 29
Progressive High-Quality Response Surfaces for Visually Guided Sensitivity Analysis
In this paper we present a technique which allows us to perform high quality and progressive response surface prediction from multidimensional input samples in an efficient manner. We utilize kriging interpolation to estimate a response surface which minimizes the expectation value and variance of the prediction error. High computational efficiency is achieved by employing parallel […]