The generic matrix-matrix multiplication (GEMM) is arguably the most popular computational kernel of the 20th century. Yet, surprisingly, no common methodology for evaluating GEMM performance has been established over the many decades of using GEMM for comparing architectures, compilers and ninja-class programmers. We introduce GEMMbench, a framework and methodology for evaluating performance of GEMM implementations. […]

November 13, 2015 by hgpu

This paper presents a new major release of the program FIESTA (Feynman Integral Evaluation by a Sector decomposiTion Approach). The new release is mainly aimed at optimal performance at large scales when one is increasing the number of sampling points in order to reduce the uncertainty estimates. The release now supports graphical processor units (GPU) […]

November 12, 2015 by hgpu

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a […]

November 4, 2015 by hgpu

The efficiency of boundary element methods depends crucially on the time required for setting up the stiffness matrix. The far-field part of the matrix can be approximated by compression schemes like the fast multipole method or $mathcal{H}$-matrix techniques. The near-field part is typically approximated by special quadrature rules like the Sauter-Schwab technique that can handle […]

October 29, 2015 by hgpu

Astrophysical direct $N$-body methods have been one of the first production algorithms to be implemented using NVIDIA’s CUDA architecture. Now, almost seven years later, the GPU is the most used accelerator device in astronomy for simulating stellar systems. In this paper we present the implementation of the Sapporo2 $N$-body library, which allows researchers to use […]

October 16, 2015 by hgpu

Study of general purpose computation by GPU (Graphics Processing Unit) can improve the image processing capability of micro-computer system. This paper studies the parallelism of the different stages of decimation in time radix 2 FFT algorithm, designs the butterfly and scramble kernels and implements 2D FFT on GPU. The experiment result demonstrates the validity and […]

June 1, 2015 by hgpu

Polynomial systems occur in many areas of science and engineering. Unlike general nonlinear systems, the algebraic structure enables to compute all solutions of a polynomial system. We describe our massive parallel predictor-corrector algorithms to track many solution paths of a polynomial homotopy. The data parallelism that provides the speedups stems from the evaluation and differentiation […]

May 10, 2015 by hgpu

In non-linear systems, where explicit analytic solutions usually can’t be found, visualisation is a powerful approach which can give insights into the dynamical behaviour of models; it is also crucial for teaching this area of mathematics. In this paper we present new software, Fireflies, which exploits the power of graphical processing unit (GPU) computing to […]

May 7, 2015 by hgpu

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of […]

April 27, 2015 by hgpu

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse […]

April 23, 2015 by hgpu

The paper presents investigations on the implementation and performance of the finite element numerical integration algorithm for first order approximations and three processor architectures, popular in scientific computing, classical CPU, Intel Xeon Phi and NVIDIA Kepler GPU. A unifying programming model and portable OpenCL implementation is considered for all architectures. Variations of the algorithm due […]

April 8, 2015 by hgpu

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the […]

March 20, 2015 by hgpu