Polynomial systems occur in many areas of science and engineering. Unlike general nonlinear systems, the algebraic structure enables to compute all solutions of a polynomial system. We describe our massive parallel predictor-corrector algorithms to track many solution paths of a polynomial homotopy. The data parallelism that provides the speedups stems from the evaluation and differentiation […]

May 10, 2015 by hgpu

In non-linear systems, where explicit analytic solutions usually can’t be found, visualisation is a powerful approach which can give insights into the dynamical behaviour of models; it is also crucial for teaching this area of mathematics. In this paper we present new software, Fireflies, which exploits the power of graphical processing unit (GPU) computing to […]

May 7, 2015 by hgpu

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of […]

April 27, 2015 by hgpu

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse […]

April 23, 2015 by hgpu

The paper presents investigations on the implementation and performance of the finite element numerical integration algorithm for first order approximations and three processor architectures, popular in scientific computing, classical CPU, Intel Xeon Phi and NVIDIA Kepler GPU. A unifying programming model and portable OpenCL implementation is considered for all architectures. Variations of the algorithm due […]

April 8, 2015 by hgpu

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the […]

March 20, 2015 by hgpu

MILJS is a collection of state-of-the-art, platform-independent, scalable, fast JavaScript libraries for matrix calculation and machine learning. Our core library offering a matrix calculation is called Sushi, which exhibits far better performance than any other leading machine learning libraries written in JavaScript. Especially, our matrix multiplication is 177 times faster than the fastest JavaScript benchmark. […]

March 2, 2015 by hgpu

This document describes an implementation in C of a set of randomized algorithms for computing partial Singular Value Decompositions (SVDs). The techniques largely follow the prescriptions in the article "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions," N. Halko, P.G. Martinsson, J. Tropp, SIAM Review, 53(2), 2011, pp. 217-288, but with some […]

February 22, 2015 by hgpu

Numerical continuation methods apply predictor-corrector algorithms to track a solution path defined by a family of systems, the so-called homotopy. The systems we consider are defined by polynomials in several variables with complex coefficients. For larger dimensions and degrees, the numerical conditioning worsens and hardware double precision becomes often insufficient to reach the end of […]

January 28, 2015 by hgpu

MatConvNet is an implementation of Convolutional Neural Networks (CNNs) for MATLAB. The toolbox is designed with an emphasis on simplicity and flexibility. It exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing linear convolutions with filter banks, feature pooling, and many more. In this manner, MatConvNet allows fast prototyping of […]

December 16, 2014 by hgpu

Lattice Quantum Chromodynamics simulations typically spend most of the runtime in inversions of the Fermion Matrix. This part is therefore frequently optimized for various HPC architectures. Here we compare the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator through […]

November 18, 2014 by hgpu

We present a library that provides optimized implementations for deep learning primitives. Deep learning workloads are computationally intensive, and optimizing the kernels of deep learning workloads is difficult and time-consuming. As parallel architectures evolve, kernels must be reoptimized for new processors, which makes maintaining codebases difficult over time. Similar issues have long been addressed in […]

October 8, 2014 by hgpu