Posts
Nov, 3
On the programmability of multi-GPU computing systems
Multi-GPU systems are widely used in High Performance Computing environments to accelerate scientific computations. This trend is expected to continue as integrated GPUs will be introduced to processors used in multi-socket servers and servers will pack a higher number of GPUs per node. GPUs are currently connected to the system through the PCI Express interconnect, […]
Nov, 3
Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs
Finite Element Methods (FEM) are ubiquitous in science and engineering where they are used in fields as diverse as structural analysis, ocean modeling and bioengineering. FEM allow us to find approximate solutions to a system of partial differential equations over an unstructured mesh. The first phase of solving a FEM problem, local assembly, involves computing […]
Nov, 3
A Framework for Transparent Execution of Massively-Parallel Applications on CUDA and OpenCL
We present a novel framework for the simultaneous development for different massively parallel platforms. Currently, our framework supports CUDA and OpenCL but it can be easily adapted to other programming languages. The main idea is to provide an easy-to-use abstraction layer that encapsulates the calls of own parallel device code as well as library functions. […]
Nov, 3
Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices
Sparse matrix vector multiplication (SpMV) is an important linear algebra primitive. Recent research has focused on improving the performance of SpMV on GPUs when using compressed sparse row (CSR), the most frequently used matrix storage format on CPUs. Efficient CSR-based SpMV obviates the need for other GPU-specific storage formats, thereby saving runtime and storage overheads. […]
Oct, 31
Energy-Efficient Execution of Data-Parallel Applications on Heterogeneous Mobile Platforms
State-of-the-art mobile system-on-chips (SoC) include heterogeneity in various forms for accelerated and energy-efficient execution of diverse range of applications. The modern SoCs now include programmable cores such as CPU and GPU with very different functionality. The SoCs also integrate performance heterogeneous cores with different power-performance characteristics but the same instruction-set architecture such as ARM big.LITTLE. […]
Oct, 31
Estimation of numerical reproducibility on CPU and GPU
Differences in simulation results may be observed from one architecture to another or even inside the same architecture. Such reproducibility failures are often due to different rounding errors generated by different orders in the sequence of arithmetic operations. Reproducibility problems are particularly noticeable on new computing architectures such as multicore processors or GPUs (Graphics Processing […]
Oct, 31
Parallelization of Encryption and Hashing Algorithm Using GPU
With the development of the GPGPU (General-purpose computing on graphics processing units), more and more computing problems are solved by using the parallel property of GPU (Graphics Processing Unit). CUDA (Compute Unified Device Architecture) is a framework which makes the GPGPU more accessible and easier to learn for the general population of programmers. This is […]
Oct, 31
Investigation of General-Purpose Computing on Graphics Processing Units and its Application to the Finite Element Analysis of Electromagnetic Problems
In this dissertation, the hardware and API architectures of GPUs are investigated, and the corresponding acceleration techniques are applied on the traditional frequency domain finite element method (FEM), the element-level time-domain methods, and the nonlinear discontinuous Galerkin method. First, the assembly and the solution phases of the FEM are parallelized and mapped onto the granular […]
Oct, 31
A general tridiagonal solver for coprocessors: Adapting g-Spike for the Intel Xeon Phi
Manycores like the Intel Xeon Phi and graphics processing units like the NVIDIA Tesla series are prime examples of systems for accelerating applications that run on current CPU multicores. It is therefore of interest to build fast, reliable linear system solvers targeting these architectures. Moreover, it is of interest to conduct cross comparisons between algorithmic […]
Oct, 31
Asynchronous Parallel Computing Algorithm implemented in 1D Heat Equation with CUDA
In this note, we present the stability as well as performance analysis of asynchronous parallel computing algorithm implemented in 1D heat equation with CUDA. The primary objective of this note lies in dissemination of asynchronous parallel computing algorithm by providing CUDA code for fast and easy implementation. We show that the simulations carried out on […]
Oct, 29
Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm
The advances of Graphic Processing Units (GPU) technology and the introduction of CUDA programming model facilitates developing new solutions for sparse and dense linear algebra solvers. Matrix Transpose is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose […]
Oct, 29
CLOP: A Multi-stage Compiler to Seamlessly Embed Heterogeneous Code
Heterogeneous programming complicates software development. We present CLOP, a platform that embeds code targeting heterogeneous compute devices in a convenient and clean way, allowing unobstructed data flow between the host code and the devices, reducing the amount of source code by an order of magnitude. The CLOP compiler uses the standard facilities of the D […]