A version of the H-LU factorization is introduced, based on the individual computational tasks occurring during the block-wise H-LU factorization. The dependencies between these tasks form a directed acylic graph, which is used for efficient scheduling on parallel systems. The algorithm is especially suited for many-core processors and shows a much improved parallel scaling behavior […]

March 14, 2014 by hgpu

We study the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and memory on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We […]

March 12, 2014 by hgpu

We present a performance analysis of a parallel implementation of both conjugate gradient and preconditioned conjugate gradient solvers using graphic processing units with CUDA parallel programming model. The solvers were optimized for a fast solution of sparse systems of equations arising from Finite Element Analysis (FEA) of electromagnetic phenomena. The preconditioners were Incomplete Cholesky factorization […]

March 12, 2014 by hgpu

We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and […]

February 28, 2014 by hgpu

Iterative solvers for sparse linear systems often benefit from using preconditioners. While there are implementations for many iterative methods that leverage the computing power of accelerators, porting the latest developments in preconditioners to accelerators has been challenging. In this paper we develop a self-adaptive multi-elimination preconditioner for graphics processing units (GPUs). The preconditioner is based […]

February 23, 2014 by hgpu

The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of the computing resources. The pressure to maintain reasonable levels of performance and portability, forces the application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic […]

January 30, 2014 by hgpu

Nowadays, the paradigm of parallel computing is changing. CUDA is now a popular programming model for general purpose computations on GPUs and a great number of applications were ported to CUDA obtaining speedups of orders of magnitude comparing to optimized CPU implementations. Hybrid approaches that combine the message passing model with the shared memory model […]

December 29, 2013 by hgpu

We present a new BSSRDF representation for editing measured anisotropic heterogeneous translucent materials, such as veined marble, jade, artificial stones with lighting-blocking discontinuities. Our work is inspired by the SubEdit representation introduced in [1]. Our main contribution is to improve the accuracy of the approximation while keeping it compact and efficient for editing.We decompose the […]

December 19, 2013 by hgpu

Many of today’s complex scientific applications now require a vast amount of computational power. General purpose graphics processing units (GPGPUs) enable researchers in a variety of fields to benefit from the computational power of all the cores available inside graphics cards. Understand the Benefits of Using GPUs for Many Scientific Applications: Designing Scientific Applications on […]

November 13, 2013 by hgpu

This paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy […]

November 11, 2013 by hgpu

The training of SVM can be viewed as a Convex Quadratic Programming (CQP) problem which becomes difficult to be solved when dealing with the large scale data sets. Traditional methods such as Sequential Minimal Optimization (SMO) for SVM training is used to solve a sequence of small scale sub-problems, which costs a large amount of […]

October 18, 2013 by hgpu

We have developed a GPU-based parallel linear solver package. When solving matrices from reservoir simulation, the parallel solvers are much more efficient than CPU-based linear solvers. However, efforts should be made to improve the setup phase of domain decomposition, the factorization of ILUT and parallelism of block ILUT preconditioner.

September 14, 2013 by hgpu